1. Introduction
The main purpose of air traffic control (ATC) is to guide each aircraft safely and efficiently through the airspace towards its destination by following applicable regulations and agreements. Safety has the highest priority, and the air traffic controller (ATCO) must ensure that the air traffic is always kept separated and that aircraft maintain at least the minimum distance according to applicable regulations (commonly 5 nautical miles lateral and 1000 ft vertical in upper airspace). In air traffic control, an intrusion (or loss of minimum separation) occurs when two aircraft are closer than allowed by applicable regulations, and the term conflict is generally used when there is a predicted intrusion, i.e., the aircraft will end up in an intrusion if no intervention is taken by the ATCO. Efficiency is also a key factor in ATC—if there are no conflicts or restrictions affecting a flight, the ATCO commonly allows the aircraft to follow its requested trajectory or even provides shortcuts, when possible, to reduce environmental impact and costs for airlines. When actions are needed, the ATCO sends directives to the aircraft using radio, either by voice or data-link messages (CPDLC). It is desirable to use as few actions as possible to maintain a low workload for both pilots and ATCOs, as well as due to limitations in communication. Voice communication is slow and limits the transmission rate, and CPDLC, while faster, must still be handled by the pilot and is limited to non-time-critical commands.
Even though there have been periods of decline in traffic, e.g., during economic recessions and pandemics, the long-term trend is that the number of flights in controlled airspace increases, and it is expected to continue to do so in the future [
1]. This results in an increasing demand for the capacity of air traffic control, which, in some areas, is already congested. The capacity of the airspace is limited by the number of sectors it reasonably can be divided into, which need to be sufficiently large to allow the air traffic controller (ATCO) to solve conflicts and sequence traffic within their sector. One solution to this problem is to increase the level of automation in the air traffic control system by providing the ATCO with better support tools to reduce the workload. This may increase safety and efficiency and reduce staff costs in non-congested areas by allowing ATCOs to provide control in larger sectors. A key process to automate in this context is that of conflict detection and resolution (CD&R), which is an active area of research.
1.1. Related Work
Automated conflict resolution in air traffic control has been an active area of research for several decades, with a large variation of methods proposed. A common approach is to use various algorithms to search for suitable conflict resolutions when a conflict has been detected. Algorithms for finding numerical solutions for lateral conflicts were presented by Bicchi and Pallottino [
2], modeling aircraft as having constant velocity in a Cartesian coordinate system. Pallottino et al. [
3] also presented methods for using mixed-integer programming to solve conflicts by using changes in either velocity or heading. Methods that allow for more realistic aircraft trajectories often separate the search for solutions from the process of validating that the trajectories are conflict-free. The NASA Automated Airspace Concept [
4] uses an ordered set of solutions generated by an expert-like system, which are validated in turn until a valid resolution is found. Stroe and Andrei suggested an algorithm called Protocol-Based Conflict Resolution [
5]. Srivatsa et al. [
6] used lattice-based search space exploration to generate a ranked set of preferred solutions to validate. Other search approaches use ant colony optimization [
7], Monte Carlo methods [
8], and particle swarm optimization [
9] to search for conflict-free trajectories. A more complete review of conflict resolution methods can be found in [
10]. The algorithmic approach to solving conflicts is, in many cases, successful but is often very computationally intensive, and in complex situations, the calculations may not be able to find a solution within a required time.
Another approach for automated conflict solving is the use of Reinforcement Learning (RL), where an agent is trained to solve conflicts. Several different methods and RL algorithms have been suggested; comprehensive and in-depth reviews of the various approaches can be found in [
11,
12]. Related to this paper are methods using Multi-Agent Reinforcement Learning (MARL), in which each agent represents an aircraft and the task of solving conflicts is handled in a cooperative manner. In many studies, MARL has been successfully applied to the CD&R problem using either a discrete or continuous action space.
Methods using a discrete action space have been shown to be able to solve conflicts in several studies using Q-learning [
13,
14,
15,
16,
17]. This adapts well to centralized air traffic control, where a few decisive actions to solve conflicts are preferred, and it also allows the output to have an explicit representation for cases when no action is needed. The drawback is that these methods often have a limited set of actions, which also limits the agent in finding an efficient solution to the conflict. Increasing the size of the action space may alleviate this problem but at the cost of it being more difficult to train. However, methods for handling this problem have been suggested [
18,
19].
Several methods using a continuous action space have also been suggested, e.g., [
20,
21,
22,
23,
24]. This allows each agent to exercise fine-grained control as how its state should be changed and significantly reduces the dimension of the output, e.g., an agent that can control speed, heading, and altitude only needs three outputs to do so. However, this approach often results in all agents making small adjustments in each step, which may not be a problem in decentralized CD&R (e.g., for drones and in free-flight airspace), but in centralized air traffic control, it would be impossible for the ATCO to communicate all actions to the pilot. One approach to reduce this problem is to use longer time steps in between the actions. This was suggested by Brittain et al. [
25,
26,
27,
28,
29], who used a time step of 12 s to incite the model to take fewer and more decisive actions. In [
30], Zhao and Liu explored the use of 20-, 40-, and 60-s time steps and concluded that “a short time-step will increase the pilot or controller’s workload, and a long time-step will increase the risk due to uncertainties in each time-step”. This method of reducing the number of actions also does not provide explicit means for the model to express that no action is needed other than ignoring values in a specific range, e.g., close to zero.
1.2. Contributions
In this paper, we introduce a novel method for adapting CD&R based on multi-agent reinforcement learning with a continuous action space to centralized air traffic control, an important step to facilitate quicker adaptation of automated support tools for conflict resolution. To accomplish this, we propose the learning of a priority level for each aircraft and formulation of the joint action so that, at most, one aircraft can execute instructions in each time step. Furthermore, actions with a priority of less than zero are never executed, allowing the agents to explicitly signal that no action is needed. By formulating the reward function such that it penalizes high priority, we show that the agents can learn to only take a few decisive actions when necessary. We show that this method significantly reduces the number of actions taken while still maintaining a high level of performance in solving conflicts (see
Figure 1).
Additionally, we show how other design choices in the learning setup are of critical importance for the ATCO, such as individually terminating aircraft during training when they exit the sector. In a range of experiments, we evaluate our method and the design choices, demonstrating high performance for automated ATC. As such, this work presents a significant step towards an automated solution that can be used in real-world scenarios.
2. Method and Experimental Setup
Here, we describe our method and simulation setup, with different variations for pinpointing the critical design choices. We also formulate a number of alternative environments, representative of previous work and existing strategies for reducing the number of actions.
2.1. ATC Simulation
For this work, we developed a simplified air traffic control simulator similar to the environments used in [
14,
22,
24]. Although more realistic alternatives exist, e.g., BlueSky [
31], such high fidelity is not needed in our experiments and would have come at the cost of reduced execution speed during training. The simulator is limited to one sector square with a 120 NM side, used for scenario generation, as the simulator allows the aircraft to travel outside the sector without any penalty. The airspace is assumed to be a free-flight airspace where the pilots are free to plan their routes without the need to follow any predefined airways. In the simulation, all aircraft are assumed to be on the same flight level (altitude), and the automated ATCO is limited to solving conflicts by turning the aircraft. Turns are executed immediately by adding the size of the turn to the current heading. In real ATC, it would take some time for the pilot to execute the turn, and the turn rate is limited. When the simulation time is advanced, each aircraft moves linearly along their heading with a distance corresponding to their speed. The objective is to guide all aircraft from their starting point to within 5 NM of their sector exit point while maintaining a separation of at least 5 NM. When the exit point has been reached, the aircraft is removed from the simulation. An image showing the simulator can be found in
Figure 2.
2.1.1. Scenario Generation
At the start of each episode, a new traffic scenario is generated by creating M aircraft with random starting points within the sector at least 20 NM from the border and random exit points on the border of the sector. Initially, each aircraft is assigned a random speed in the range of NM/h and a heading towards its exit point. The scenario generator assures that aircraft are separated by at least 10 NM (two times the separation minima) at the start of a scenario.
2.1.2. Environments
Gymnasium [
32] environments wrapping the simulator were developed to provide the desired behavior for each of the experiments. The observation space for each environment can be found in
Table 1 and the reward in
Table 2. Common for all environments is that the action space has a parameter representing the size of the turn. This is a value in the range of
, which is mapped to the range of
degrees by the simulation environment. The episode terminates when all aircraft have reached their exit points or after 350 steps.
- SA
The single-action environment is the main focus of this experiment and allows for, at most, one aircraft to execute a turn in each step. To determine which turn should be executed, an additional parameter in the action space called priority is used to select, at most, one action in each step (see
Section 2.1.3). In this environment, each time step advances the simulation time by five seconds, which is a common update rate of radar trackers.
- PA
The parallel action environment is identical to the SA environment, except without the priority mechanism. Thus, each aircraft can execute a turn in each step. This environment is used as a reference to compare the performance with and without the priority mechanism.
- PL5, PL20
The parallel limited action environments allow any of the aircraft to execute a turn in each step. By only executing turns that are at least 1 degree in magnitude, we allow the agent to take no action. To try to reduce the number of turns used, we use a fixed cost in the reward function for each executed turn. In the PL5 environment the time step is 5 s, and in PL20, the time step is 20 s. This environment is used to compare the priority mechanism with the use of reward and increased time steps to limit the number of actions.
- RE
Finally, the reference environment is used as a baseline to compare our results against the method suggested by Badea et al. [
24] by using their observation space and reward function. However, the action space used here is limited to use only turns. In this environment, any turn with a magnitude greater than zero is executed.
2.1.3. The Priority Filter
The priority mechanism is used to limit the number of simultaneous commands in the multi agent environment, as well as to provide an option not to take an action. This is accomplished by having an extra output for each agent, called priority, in the range of
(thus, in our example, each agent has two outputs: priority and turn size). As in normal multi-agent reinforcement learning, the list of actions for all agents is sent to the environment in each step; however, in our method, the action list is processed by the priority filter (see
Figure 1). The priority filter removes all actions from the list except for the one with the highest priority value. If the priority value for the remaining action is <0, this is also removed. Thus, the list of actions sent from the environment to the simulator has, at most, one action in it. The agent must then learn to provide proper priority during training, as well as learning, to turn aircraft to avoid separation losses between aircraft. To make the agent learn to avoid using unnecessarily high priorities, we also assigned a small negative reward of
when
(see
Table 2). The magnitude of the priority reward was selected based on the magnitudes used for drift and turn size.
If we denote each agent’s actions with a tuple , where p is the priory and is the size of the turn, would then mean that the agent suggests a turn of size 0.44 at approximately 10 degrees, with priority of 0.3, leaving room for agents with higher priority to execute their actions. Using a simple environment with three agents as an example, if the input to the environment is , the filter would select the last action to be sent to the simulator for execution, since the priority of 0.9 is higher than the rest. If, on the other hand, the action were , the filter would first select the middle action with priority , since this is the highest, but this would also be removed, since the priority is <0; therefore, no action would be sent to the simulator in this case.
The idea behind this mechanism is to emulate how human ATCOs work by prioritizing the most important task first, then moving on to other tasks and explicitly stating that nothing should be done by providing priorities <0. The mechanism could also be used to select any number of actions based on priority. In this case, we selected one action in each step of 5 s, since this is a reasonable rate when using voice over radio to provide commands to the pilot.
2.2. Model
Our simulation environment can be abstracted as a decentralized partially observable Markov decision process (Dec-POMDP) [
33]. The Dec-POMDP model extends the single-agent POMDP models by considering joint actions and observations. At every step (
k), each agent (
i) takes an action (
), leading to one joint action (
) at every step; similarly, we have a joint partial observation (
) of the global state (
). The reward function (
) maps states and joint actions to a real numbered reward in each step (t), which is used to specify the goal of the agents. To fully specify an optimality criterion over all the steps, we use the expected cumulative reward,
, where
is the discount factor. The problem is then to find a joint policy that maximizes the optimality criterion.
In this research, we use multi-agent reinforcement learning to learn the joint policy. The policy is trained using the soft actor critic (SAC) algorithm with automatic tuning of entropy [
34]. SAC is a model-free actor–critic algorithm in the continuous domain based on the maximum entropy framework, which provides improvements in both exploration and robustness. Since SAC is an off-policy algorithm, it also increases the sample efficiency as compared to on-policy algorithms. SAC has been shown to outperform other state-of-the-art algorithms, e.g., DDPG and PPO, in many tasks and has produced good results in similar experiments [
24].
During training, the replay buffer is updated with for each agent (i) in each step (k) of the simulation, where o represents the observations, a represents actions, r is the shared reward, and t represents terminal states. The observations for the continuous parameters were normalized so that the average was 0 and the standard deviation was 1 for these parameters; flags (boolean parameters) were kept as either 0 or 1. Samples from the replay buffer are then used to train the networks. The training algorithm is described in detail in Algorithm 1.
In MARL, the terminal state (
) is set to true for all agents (
) when the success criteria are met or when it is truncated due to reaching the maximum number of time steps allowed, i.e., a global termination. In the previous work that is most similar to ours [
24], global termination was used, resulting in only the agents remaining in the environment when the episode reaches its terminal state being flagged as terminated, even though the aircraft are removed when reaching their exit point. Instead, we suggest that the terminal flag be set on an individual basis when an aircraft reaches its exit point and is removed from the simulation, since the policy is trained on samples from the replay buffer that represents individual aircraft. A common setup in MARL is to keep all the agents active until a global termination criterion is met. A key difference in our simulation, however, is that each agent is independently removed from the simulation when it reaches its exit point and, thus, does not affect the outcome in the rest of the simulation.
Algorithm 1 Training of the policy |
Initialize replay buffer Initialize policy for do ▹ is the number of episodes reset the simulation with seed from seed sequence ▹ Where k is the time step normalized observations ▹ Episode is terminated while do for do ▹ = remaining AC sample the policy using end for Update the simulation with the actions normalized observations reward from the simulation true if episode is done or truncated for do if using global termination then else if agent i has exited then else end if end for for do Add tuple to end for Sample mini batch from Train policy using the SAC algorithm using end while end for
|
2.3. Training, Validation, and Tests
To evaluate our approach, we ran several experiments and trained models using different environments and different scenario sizes (M) with 10, 15, or 20 initial aircraft. This represents a high to very high traffic load for a sector with all the aircraft at the same altitude. The number of observed closest aircraft (N) was varied to be either 2, 4, 8, or 12. With more observed neighbors, it should be possible to make better decisions when solving conflicts, although this scenario also increases the complexity of the input data and may, therefore, be more difficult to train. We trained policies for each combination of termination method, observation size, and scenario size where the scenario size was greater than the observation size—in total, 22 combinations for the SA environment. For the reference environments, we trained one model each of PA, PL5, PL20, and RE using an observation size of 4 and a scenario size of 15 and used individual termination. During training, we recorded the reward, the number of time steps, and the number of intrusions in each episode. Here, the number of intrusions for each time step is the number of aircraft pairs that have a distance of less than 5 NM, and the total number of intrusions in an episode is the sum of all the intrusions in each step. Thus, a pair of aircraft that has a separation loss that lasts for two time steps is counted twice, and an aircraft that has a separation loss with two different aircraft at the same time is also counted as two intrusions. Every 10th episode, a predefined set of 20 randomly generated validation scenarios was run, and the average of the same parameters was recorded. For each trained policy, the 10 models with best validation results were saved. All models were trained for 30,000 epochs, except for PL20, where we used 120,000 to obtain an equivalent amount of training data.
The network used in all environments was a fully connected network with three hidden layers of size 256 and an input and output that correspond to the dimensions of state and action space, respectively. The ReLU activation function was used in all layers. The discount factor () was set to 0.99, and the soft update coefficient () was set to 0.005. The replay buffer size was set to 1,000,000, with the first 100 steps randomly initialized. During training, the policy was updated every second step using a batch size of 256 and the AdamW optimizer with a learning rate of 0.0005.
The trained models were tested using scenarios with 10, 15, and 20 aircraft, using 1000 predefined randomly generated scenarios of each size (different from the training and validation scenarios). All models were tested on the different scenario sizes, regardless of the training scenario size, except models with an observation size of 12, where only scenarios with sizes of 15 and 20 were used. For each scenario, the number of intrusions, distance overhead, and actions were recorded. The number of intrusions is defined in the same way as in the training and validation data, and the distance overhead is the actual distance flown in an episode divided by the minimum possible distance along a straight line from the starting point to the exit point. To obtain a comparable value for the number of intrusions in the PL20 environment, the 20 s time step was conducted as 4 updates of 5 s each in the simulation layer; thus, an intrusion with the same duration was counted the same number of times in each environment. All the scenarios were also run without taking any action to calculate the baseline number of intrusions in the test scenarios.
4. Discussion and Conclusions
When using multi-agent reinforcement learning to provide conflict resolutions in centralized air traffic control, it is not only important to solve the conflicts but also to do so with a few and decisive actions in order to not overload the ATCO and pilot. This is easier to achieve using a discrete action-space model, where one of the actions is to do nothing. However, these discrete models often need a high-dimensional action space to cover all combinations of actions, which makes them difficult to train. Models using a continuous action space do not suffer from this problem, but the trained models often solve conflicts by several minor adjustments performed in parallel for many agents. This results in a high number of actions, which is unsuitable for centralized air traffic control. Continuous models are therefore often suggested to be used in scenarios with decentralized control, such as in free flight or unmanned airspace, in which each agent is responsible for solving its own conflicts. To the best of our knowledge, the only suggestion for reducing the number of actions in previous work has been to use the reward function to penalize actions and increase the time between steps, thereby reducing the possible number of actions taken during an episode. These models also often lack an explicit way to signal that no action should be performed other than an action of exactly size zero (or in a small, designated range around zero).
In this paper we have suggested a new method to adapt continuous action-space MARL models for conflict resolution in centralized air traffic control. In our method, each agent estimates the importance of its action as an additional output of its action space such that in each step, the environment only executes the action with the highest priority. Furthermore, since negative priorities are never executed, agents can also explicitly take no action. This forces the agents to work in a serial rather than parallel manner when solving conflicts, and they must also learn to provide reasonable priorities at a group level to solve them. This method of reducing the number of actions is also more similar to how ATCOs operate, i.e., monitoring their sector and executing more important actions before attending to less pressing matters.
Our experimental results show that the agents learn to guide aircraft to their sector exit points while avoiding conflicts, using only a few actions per aircraft and a small distance overhead. Even though a competitive element was added, since only one agent could execute its action, the shared reward was enough to maintain cooperation. The number of unsolved intrusions was kept low, on par with PL5, where the reward function was used to reduce the number of actions but where parallel actions were allowed. However, in our method, the number of actions used was an order of magnitude lower. Increasing the time step, such as in PL20, reduced the number of actions, but compared to our method, the reduction was smaller, and the PL20 model also performed worse, with more intrusions and a larger distance overhead.
We also showed that increasing the traffic density in the training scenarios had a positive effect on the performance, even when used in lower-density test scenarios. This is in line with the work presented by Groot et al. [
35].
Increasing the number of aircraft the agent can observe should allow the agent to make better decisions; however, the test results show a reduction in performance. Using a larger observation size also results in less stability during training, which indicates that the policy had difficulties learning to utilize the increased information about its environment. Even though, in this case, observing only two aircraft results in the highest reward, it would be desirable to increase the performance when using more observations, since one could find examples of situations where only observing two aircraft may lead to sub-optimal or even unsafe actions. A solution to this may be the use of LSTM and attention modules, as suggested by Brittain et al. [
26,
27].
The experiments also showed that using individual termination when storing data in the playback buffer, i.e., where agents are removed from the simulation at separate times, significantly increases the stability and convergence when training a MARL policy.
Future Work
There are still many problems that need to be solved to make multi-agent reinforcement learning a usable tool for conflict resolution in operational air traffic control. One major aspect is increased realism in the simulation itself by extending it to include vertical traffic with realistic performance, as well as realistic delays when executing actions. For example, when initiating a turn, it should take some time for the pilot to react, and the aircraft should execute the turn with a realistic turn rate. It must also be considered that when issuing actions to change speed and flight level, the pilot may reject the instructions due to performance limitations. More realistic trajectories also need to be included. Since aircraft mostly follow their planned route, there may be planned trajectory change points where the aircraft turns without any interaction with air traffic control. It is also important to include more complex actions, such as combined actions and trajectory-based operations, which is currently being worked on in the SESAR 3 Joint Undertaking program. Finally, sector borders should be considered, and weather conditions should be included to introduce realistic uncertainties in the simulation.
Another key area that needs improvement is with respect to increasing the performance, especially when the number of observed neighbors is increased. Some possible solutions to this problem may be to use more advance architectures for the neural network in the policy, to find other ways to represent the data, or to use algorithms to enhance the data and/or to select the neighbors that are most relevant, not just selecting neighbors according to the closest distance.