Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions

Nilsson, Jens; Unger, Jonas; Eilertsen, Gabriel

doi:10.3390/aerospace12020088

Open AccessArticle

Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions

by

Jens Nilsson

^1,2,*

,

Jonas Unger

¹

and

Gabriel Eilertsen

¹

Department of Science and Technology, Linköping University, 581 83 Linköping, Sweden

²

LFV, Research and Development Department, 601 79 Norrköping, Sweden

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(2), 88; https://doi.org/10.3390/aerospace12020088

Submission received: 13 November 2024 / Revised: 10 January 2025 / Accepted: 17 January 2025 / Published: 25 January 2025

(This article belongs to the Section Air Traffic and Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

Air traffic control (ATC) relies on a series of complex tasks, the most crucial aspect of which is to assure safe separation between aircraft. Due to the increase in air traffic, decision support systems and safe and robust automation of ATC tasks are of high value. Automated conflict resolution has been an active area of research for decades, and in more recent years, reinforcement learning has been suggested as a powerful alternative to traditional algorithms. Reinforcement learning using discrete action spaces often require large action spaces to cover all combinations of actions, which can make them difficult to train. On the other hand, models with continuous action spaces require much lower dimensionality but often learn to solve conflicts by using a large number of exceedingly small actions. This makes them more suitable for decentralized ATC, such as in unmanned or free-flight airspace. In this paper, we present a novel multi-agent reinforcement learning method with a continuous action space that significantly reduces the number of actions by means of a learning-based priority mechanism. We demonstrate how this can keep the number of actions to a minimum while successfully resolving conflicts with little overhead in the distance required for the aircraft to reach their exit points. As such, the proposed solution is well-suited for centralized ATC, where the number of directives that can be transmitted to aircraft is limited.

Keywords:

air traffic control; conflict resolution; reinforcement learning

1. Introduction

The main purpose of air traffic control (ATC) is to guide each aircraft safely and efficiently through the airspace towards its destination by following applicable regulations and agreements. Safety has the highest priority, and the air traffic controller (ATCO) must ensure that the air traffic is always kept separated and that aircraft maintain at least the minimum distance according to applicable regulations (commonly 5 nautical miles lateral and 1000 ft vertical in upper airspace). In air traffic control, an intrusion (or loss of minimum separation) occurs when two aircraft are closer than allowed by applicable regulations, and the term conflict is generally used when there is a predicted intrusion, i.e., the aircraft will end up in an intrusion if no intervention is taken by the ATCO. Efficiency is also a key factor in ATC—if there are no conflicts or restrictions affecting a flight, the ATCO commonly allows the aircraft to follow its requested trajectory or even provides shortcuts, when possible, to reduce environmental impact and costs for airlines. When actions are needed, the ATCO sends directives to the aircraft using radio, either by voice or data-link messages (CPDLC). It is desirable to use as few actions as possible to maintain a low workload for both pilots and ATCOs, as well as due to limitations in communication. Voice communication is slow and limits the transmission rate, and CPDLC, while faster, must still be handled by the pilot and is limited to non-time-critical commands.

Even though there have been periods of decline in traffic, e.g., during economic recessions and pandemics, the long-term trend is that the number of flights in controlled airspace increases, and it is expected to continue to do so in the future [1]. This results in an increasing demand for the capacity of air traffic control, which, in some areas, is already congested. The capacity of the airspace is limited by the number of sectors it reasonably can be divided into, which need to be sufficiently large to allow the air traffic controller (ATCO) to solve conflicts and sequence traffic within their sector. One solution to this problem is to increase the level of automation in the air traffic control system by providing the ATCO with better support tools to reduce the workload. This may increase safety and efficiency and reduce staff costs in non-congested areas by allowing ATCOs to provide control in larger sectors. A key process to automate in this context is that of conflict detection and resolution (CD&R), which is an active area of research.

1.1. Related Work

Automated conflict resolution in air traffic control has been an active area of research for several decades, with a large variation of methods proposed. A common approach is to use various algorithms to search for suitable conflict resolutions when a conflict has been detected. Algorithms for finding numerical solutions for lateral conflicts were presented by Bicchi and Pallottino [2], modeling aircraft as having constant velocity in a Cartesian coordinate system. Pallottino et al. [3] also presented methods for using mixed-integer programming to solve conflicts by using changes in either velocity or heading. Methods that allow for more realistic aircraft trajectories often separate the search for solutions from the process of validating that the trajectories are conflict-free. The NASA Automated Airspace Concept [4] uses an ordered set of solutions generated by an expert-like system, which are validated in turn until a valid resolution is found. Stroe and Andrei suggested an algorithm called Protocol-Based Conflict Resolution [5]. Srivatsa et al. [6] used lattice-based search space exploration to generate a ranked set of preferred solutions to validate. Other search approaches use ant colony optimization [7], Monte Carlo methods [8], and particle swarm optimization [9] to search for conflict-free trajectories. A more complete review of conflict resolution methods can be found in [10]. The algorithmic approach to solving conflicts is, in many cases, successful but is often very computationally intensive, and in complex situations, the calculations may not be able to find a solution within a required time.

Another approach for automated conflict solving is the use of Reinforcement Learning (RL), where an agent is trained to solve conflicts. Several different methods and RL algorithms have been suggested; comprehensive and in-depth reviews of the various approaches can be found in [11,12]. Related to this paper are methods using Multi-Agent Reinforcement Learning (MARL), in which each agent represents an aircraft and the task of solving conflicts is handled in a cooperative manner. In many studies, MARL has been successfully applied to the CD&R problem using either a discrete or continuous action space.

Methods using a discrete action space have been shown to be able to solve conflicts in several studies using Q-learning [13,14,15,16,17]. This adapts well to centralized air traffic control, where a few decisive actions to solve conflicts are preferred, and it also allows the output to have an explicit representation for cases when no action is needed. The drawback is that these methods often have a limited set of actions, which also limits the agent in finding an efficient solution to the conflict. Increasing the size of the action space may alleviate this problem but at the cost of it being more difficult to train. However, methods for handling this problem have been suggested [18,19].

Several methods using a continuous action space have also been suggested, e.g., [20,21,22,23,24]. This allows each agent to exercise fine-grained control as how its state should be changed and significantly reduces the dimension of the output, e.g., an agent that can control speed, heading, and altitude only needs three outputs to do so. However, this approach often results in all agents making small adjustments in each step, which may not be a problem in decentralized CD&R (e.g., for drones and in free-flight airspace), but in centralized air traffic control, it would be impossible for the ATCO to communicate all actions to the pilot. One approach to reduce this problem is to use longer time steps in between the actions. This was suggested by Brittain et al. [25,26,27,28,29], who used a time step of 12 s to incite the model to take fewer and more decisive actions. In [30], Zhao and Liu explored the use of 20-, 40-, and 60-s time steps and concluded that “a short time-step will increase the pilot or controller’s workload, and a long time-step will increase the risk due to uncertainties in each time-step”. This method of reducing the number of actions also does not provide explicit means for the model to express that no action is needed other than ignoring values in a specific range, e.g., close to zero.

1.2. Contributions

In this paper, we introduce a novel method for adapting CD&R based on multi-agent reinforcement learning with a continuous action space to centralized air traffic control, an important step to facilitate quicker adaptation of automated support tools for conflict resolution. To accomplish this, we propose the learning of a priority level for each aircraft and formulation of the joint action so that, at most, one aircraft can execute instructions in each time step. Furthermore, actions with a priority of less than zero are never executed, allowing the agents to explicitly signal that no action is needed. By formulating the reward function such that it penalizes high priority, we show that the agents can learn to only take a few decisive actions when necessary. We show that this method significantly reduces the number of actions taken while still maintaining a high level of performance in solving conflicts (see Figure 1).

Additionally, we show how other design choices in the learning setup are of critical importance for the ATCO, such as individually terminating aircraft during training when they exit the sector. In a range of experiments, we evaluate our method and the design choices, demonstrating high performance for automated ATC. As such, this work presents a significant step towards an automated solution that can be used in real-world scenarios.

2. Method and Experimental Setup

Here, we describe our method and simulation setup, with different variations for pinpointing the critical design choices. We also formulate a number of alternative environments, representative of previous work and existing strategies for reducing the number of actions.

2.1. ATC Simulation

For this work, we developed a simplified air traffic control simulator similar to the environments used in [14,22,24]. Although more realistic alternatives exist, e.g., BlueSky [31], such high fidelity is not needed in our experiments and would have come at the cost of reduced execution speed during training. The simulator is limited to one sector square with a 120 NM side, used for scenario generation, as the simulator allows the aircraft to travel outside the sector without any penalty. The airspace is assumed to be a free-flight airspace where the pilots are free to plan their routes without the need to follow any predefined airways. In the simulation, all aircraft are assumed to be on the same flight level (altitude), and the automated ATCO is limited to solving conflicts by turning the aircraft. Turns are executed immediately by adding the size of the turn to the current heading. In real ATC, it would take some time for the pilot to execute the turn, and the turn rate is limited. When the simulation time is advanced, each aircraft moves linearly along their heading with a distance corresponding to their speed. The objective is to guide all aircraft from their starting point to within 5 NM of their sector exit point while maintaining a separation of at least 5 NM. When the exit point has been reached, the aircraft is removed from the simulation. An image showing the simulator can be found in Figure 2.

2.1.1. Scenario Generation

At the start of each episode, a new traffic scenario is generated by creating M aircraft with random starting points within the sector at least 20 NM from the border and random exit points on the border of the sector. Initially, each aircraft is assigned a random speed in the range of

[400, 500)

NM/h and a heading towards its exit point. The scenario generator assures that aircraft are separated by at least 10 NM (two times the separation minima) at the start of a scenario.

2.1.2. Environments

Gymnasium [32] environments wrapping the simulator were developed to provide the desired behavior for each of the experiments. The observation space for each environment can be found in Table 1 and the reward in Table 2. Common for all environments is that the action space has a parameter representing the size of the turn. This is a value in the range of

[- 1, 1]

, which is mapped to the range of

[- 22.5, 22.5]

degrees by the simulation environment. The episode terminates when all aircraft have reached their exit points or after 350 steps.

SA: The single-action environment is the main focus of this experiment and allows for, at most, one aircraft to execute a turn in each step. To determine which turn should be executed, an additional parameter in the action space called priority is used to select, at most, one action in each step (see Section 2.1.3). In this environment, each time step advances the simulation time by five seconds, which is a common update rate of radar trackers.
PA: The parallel action environment is identical to the SA environment, except without the priority mechanism. Thus, each aircraft can execute a turn in each step. This environment is used as a reference to compare the performance with and without the priority mechanism.
PL5, PL20: The parallel limited action environments allow any of the aircraft to execute a turn in each step. By only executing turns that are at least 1 degree in magnitude, we allow the agent to take no action. To try to reduce the number of turns used, we use a fixed cost in the reward function for each executed turn. In the PL5 environment the time step is 5 s, and in PL20, the time step is 20 s. This environment is used to compare the priority mechanism with the use of reward and increased time steps to limit the number of actions.
RE: Finally, the reference environment is used as a baseline to compare our results against the method suggested by Badea et al. [24] by using their observation space and reward function. However, the action space used here is limited to use only turns. In this environment, any turn with a magnitude greater than zero is executed.

2.1.3. The Priority Filter

The priority mechanism is used to limit the number of simultaneous commands in the multi agent environment, as well as to provide an option not to take an action. This is accomplished by having an extra output for each agent, called priority, in the range of

[- 1, 1]

(thus, in our example, each agent has two outputs: priority and turn size). As in normal multi-agent reinforcement learning, the list of actions for all agents is sent to the environment in each step; however, in our method, the action list is processed by the priority filter (see Figure 1). The priority filter removes all actions from the list except for the one with the highest priority value. If the priority value for the remaining action is <0, this is also removed. Thus, the list of actions sent from the environment to the simulator has, at most, one action in it. The agent must then learn to provide proper priority during training, as well as learning, to turn aircraft to avoid separation losses between aircraft. To make the agent learn to avoid using unnecessarily high priorities, we also assigned a small negative reward of

- 0.1 p

when

p \geq 0

(see Table 2). The magnitude of the priority reward was selected based on the magnitudes used for drift and turn size.

If we denote each agent’s actions with a tuple

(p, α)

, where p is the priory and

α

is the size of the turn,

(0.3, 0.44)

would then mean that the agent suggests a turn of size 0.44 at approximately 10 degrees, with priority of 0.3, leaving room for agents with higher priority to execute their actions. Using a simple environment with three agents as an example, if the input to the environment is

(0.3, 0.44), (- 0.1, 0.1), (0.9, 0.2)

, the filter would select the last action to be sent to the simulator for execution, since the priority of 0.9 is higher than the rest. If, on the other hand, the action were

(- 0.3, 0.44), (- 0.1, 0.1), (- 0.9, 0.2)

, the filter would first select the middle action with priority

- 0.1

, since this is the highest, but this would also be removed, since the priority is <0; therefore, no action would be sent to the simulator in this case.

The idea behind this mechanism is to emulate how human ATCOs work by prioritizing the most important task first, then moving on to other tasks and explicitly stating that nothing should be done by providing priorities <0. The mechanism could also be used to select any number of actions based on priority. In this case, we selected one action in each step of 5 s, since this is a reasonable rate when using voice over radio to provide commands to the pilot.

2.2. Model

Our simulation environment can be abstracted as a decentralized partially observable Markov decision process (Dec-POMDP) [33]. The Dec-POMDP model extends the single-agent POMDP models by considering joint actions and observations. At every step (k), each agent (i) takes an action (

a_{i, k}

), leading to one joint action (

a_{k} = 〈a_{1, k}, \dots, a_{M, k}〉

) at every step; similarly, we have a joint partial observation (

o_{k} = 〈o_{1, k}, \dots, o_{M, k}〉

) of the global state (

s_{k}

). The reward function (

R : S \times A \mapsto R

) maps states and joint actions to a real numbered reward in each step (t), which is used to specify the goal of the agents. To fully specify an optimality criterion over all the steps, we use the expected cumulative reward,

E [γ^{t} R (s_{k}, a_{k})]

, where

γ \in [0, 1]

is the discount factor. The problem is then to find a joint policy that maximizes the optimality criterion.

In this research, we use multi-agent reinforcement learning to learn the joint policy. The policy is trained using the soft actor critic (SAC) algorithm with automatic tuning of entropy [34]. SAC is a model-free actor–critic algorithm in the continuous domain based on the maximum entropy framework, which provides improvements in both exploration and robustness. Since SAC is an off-policy algorithm, it also increases the sample efficiency as compared to on-policy algorithms. SAC has been shown to outperform other state-of-the-art algorithms, e.g., DDPG and PPO, in many tasks and has produced good results in similar experiments [24].

During training, the replay buffer is updated with

(o_{k, i}, a_{k, i}, o_{k + 1, i}, r_{k + 1}, t_{k + 1, i})

for each agent (i) in each step (k) of the simulation, where o represents the observations, a represents actions, r is the shared reward, and t represents terminal states. The observations for the continuous parameters were normalized so that the average was

\approx

0 and the standard deviation was

\approx

1 for these parameters; flags (boolean parameters) were kept as either 0 or 1. Samples from the replay buffer are then used to train the networks. The training algorithm is described in detail in Algorithm 1.

In MARL, the terminal state (

t_{k + 1, i}

) is set to true for all agents (

a_{i}

) when the success criteria are met or when it is truncated due to reaching the maximum number of time steps allowed, i.e., a global termination. In the previous work that is most similar to ours [24], global termination was used, resulting in only the agents remaining in the environment when the episode reaches its terminal state being flagged as terminated, even though the aircraft are removed when reaching their exit point. Instead, we suggest that the terminal flag be set on an individual basis when an aircraft reaches its exit point and is removed from the simulation, since the policy is trained on samples from the replay buffer that represents individual aircraft. A common setup in MARL is to keep all the agents active until a global termination criterion is met. A key difference in our simulation, however, is that each agent is independently removed from the simulation when it reaches its exit point and, thus, does not affect the outcome in the rest of the simulation.

Algorithm 1 Training of the policy

Initialize replay buffer $D$
Initialize policy $A$
for $e = 1, \dots E_{m a x}$ do ▹ $E_{m a x}$ is the number of episodes
reset the simulation with seed from seed sequence
$k \leftarrow 1$ ▹ Where k is the time step
$o_{k} \leftarrow$ normalized observations
$d \leftarrow false$ ▹ Episode is terminated
while $\neg d$ do
for $i = 1, \dots M_{r e m}$ do ▹ $M_{r e m}$ = remaining AC
$a_{k, i} \leftarrow$ sample the policy $A$ using $o_{k, i}$
end for
Update the simulation with the actions $a_{k}$
$o_{k + 1} \leftarrow$ normalized observations
$r_{k + 1} \leftarrow$ reward from the simulation
$d \leftarrow$ true if episode is done or truncated
for $i = 1, \dots M_{r e m}$ do
if using global termination then
$t_{k + 1, i} \leftarrow d$
else if agent i has exited then
$t_{k + 1, i} \leftarrow t r u e$
else
$t_{k + 1, i} \leftarrow f a l s e$
end if
end for
for $i = 1, \dots M_{r e m}$ do
Add tuple $(o_{k, i}, a_{k, i}, o_{k + 1, i}, r_{k + 1}, t_{k + 1, i})$ to $D$
end for
Sample mini batch $B$ from $D$
Train policy $A$ using the SAC algorithm using $B$
end while
end for

2.3. Training, Validation, and Tests

To evaluate our approach, we ran several experiments and trained models using different environments and different scenario sizes (M) with 10, 15, or 20 initial aircraft. This represents a high to very high traffic load for a sector with all the aircraft at the same altitude. The number of observed closest aircraft (N) was varied to be either 2, 4, 8, or 12. With more observed neighbors, it should be possible to make better decisions when solving conflicts, although this scenario also increases the complexity of the input data and may, therefore, be more difficult to train. We trained policies for each combination of termination method, observation size, and scenario size where the scenario size was greater than the observation size—in total, 22 combinations for the SA environment. For the reference environments, we trained one model each of PA, PL5, PL20, and RE using an observation size of 4 and a scenario size of 15 and used individual termination. During training, we recorded the reward, the number of time steps, and the number of intrusions in each episode. Here, the number of intrusions for each time step is the number of aircraft pairs that have a distance of less than 5 NM, and the total number of intrusions in an episode is the sum of all the intrusions in each step. Thus, a pair of aircraft that has a separation loss that lasts for two time steps is counted twice, and an aircraft that has a separation loss with two different aircraft at the same time is also counted as two intrusions. Every 10th episode, a predefined set of 20 randomly generated validation scenarios was run, and the average of the same parameters was recorded. For each trained policy, the 10 models with best validation results were saved. All models were trained for 30,000 epochs, except for PL20, where we used 120,000 to obtain an equivalent amount of training data.

The network used in all environments was a fully connected network with three hidden layers of size 256 and an input and output that correspond to the dimensions of state and action space, respectively. The ReLU activation function was used in all layers. The discount factor (

γ

) was set to 0.99, and the soft update coefficient (

τ

) was set to 0.005. The replay buffer size was set to 1,000,000, with the first 100 steps randomly initialized. During training, the policy was updated every second step using a batch size of 256 and the AdamW optimizer with a learning rate of 0.0005.

The trained models were tested using scenarios with 10, 15, and 20 aircraft, using 1000 predefined randomly generated scenarios of each size (different from the training and validation scenarios). All models were tested on the different scenario sizes, regardless of the training scenario size, except models with an observation size of 12, where only scenarios with sizes of 15 and 20 were used. For each scenario, the number of intrusions, distance overhead, and actions were recorded. The number of intrusions is defined in the same way as in the training and validation data, and the distance overhead is the actual distance flown in an episode divided by the minimum possible distance along a straight line from the starting point to the exit point. To obtain a comparable value for the number of intrusions in the PL20 environment, the 20 s time step was conducted as 4 updates of 5 s each in the simulation layer; thus, an intrusion with the same duration was counted the same number of times in each environment. All the scenarios were also run without taking any action to calculate the baseline number of intrusions in the test scenarios.

3. Results

3.1. Training and Validation Results

In total, 26 different models were trained for the various combination of environment, observation size, scenario size, and termination type. The data collected during training showed that the models learned to increase the reward and were able to reach their maximum potential well before reaching the maximum number of episodes (see Table 3). The most important result that is obvious already from the training and validation data is that using individual termination improves stability and performance in all cases; some examples of this are shown in the plots in Figure 3. Plots showing all the training and validation results comparing global and individual training can be found in the appendices in Figure A1 and Figure A2. Therefore, we trained the reference scenarios using individual termination and only applied inference to generate test data using models trained with individual termination. In the test results, we can also see that using larger observation sizes tends to reduce performance.

3.2. Test Results

A comparison of our solution, SA, with the other methods can be found in Figure 4, Figure 5 and Figure 6. Comparing the number of intrusions in Figure 4, we can see that all methods reduce the number of intrusions compared to the baseline values (see Table 4). The policies trained using the PA and RE environments, which did not have a mechanism to reduce the number of actions, performed the best in terms of intrusions, closely followed by the SA and PL5 policies. In terms of the number of intrusions, the PL20 model performed significantly worse than the other models. When comparing the number of actions (Figure 5), we can set that the different mechanisms used in SA, PL5, and PL20 have a significant impact in terms of reducing the number of actions; on average, SA uses 1.7 to 2.7 actions per aircraft depending on the scenario size, PL20 uses 7.3 to 15.8, and PL5 uses 19.4 to 24.0. If we only count turns with a size of at least one degree, PA and RE use, on average, 93–100 actions per aircraft in each episode. As can be seen in Figure 6, all trained models, except PL20, also have a low distance overhead.

A comparison of different combinations of observation sizes, training scenario sizes, and test scenario sizes when using the SA environment can be found in Figure 7, Figure 8 and Figure 9. Looking at the number of intrusions per aircraft (Figure 7), we can see that increasing the number of observed aircraft decreases the performance, as indicated during testing. We can also see that policies trained using higher traffic densities generally perform better, even when used in low-density test scenarios. At the lower observation sizes, this is also confirmed when looking at the distance overhead (Figure 9). It is more difficult to see any strong tendencies when looking at the number of actions used (Figure 8), but for all combinations, it remains low, in the range of 1.67 to 4.24 actions per aircraft and episode, and the average number of total actions per episode was 16.7–84.8 (based on the average per aircraft times and the numbers of aircraft in the different scenarios). Compared to the average episode length, which was in the range 199.3–206.9 (see Table 4), we can see that in most steps, no action was executed.

4. Discussion and Conclusions

When using multi-agent reinforcement learning to provide conflict resolutions in centralized air traffic control, it is not only important to solve the conflicts but also to do so with a few and decisive actions in order to not overload the ATCO and pilot. This is easier to achieve using a discrete action-space model, where one of the actions is to do nothing. However, these discrete models often need a high-dimensional action space to cover all combinations of actions, which makes them difficult to train. Models using a continuous action space do not suffer from this problem, but the trained models often solve conflicts by several minor adjustments performed in parallel for many agents. This results in a high number of actions, which is unsuitable for centralized air traffic control. Continuous models are therefore often suggested to be used in scenarios with decentralized control, such as in free flight or unmanned airspace, in which each agent is responsible for solving its own conflicts. To the best of our knowledge, the only suggestion for reducing the number of actions in previous work has been to use the reward function to penalize actions and increase the time between steps, thereby reducing the possible number of actions taken during an episode. These models also often lack an explicit way to signal that no action should be performed other than an action of exactly size zero (or in a small, designated range around zero).

In this paper we have suggested a new method to adapt continuous action-space MARL models for conflict resolution in centralized air traffic control. In our method, each agent estimates the importance of its action as an additional output of its action space such that in each step, the environment only executes the action with the highest priority. Furthermore, since negative priorities are never executed, agents can also explicitly take no action. This forces the agents to work in a serial rather than parallel manner when solving conflicts, and they must also learn to provide reasonable priorities at a group level to solve them. This method of reducing the number of actions is also more similar to how ATCOs operate, i.e., monitoring their sector and executing more important actions before attending to less pressing matters.

Our experimental results show that the agents learn to guide aircraft to their sector exit points while avoiding conflicts, using only a few actions per aircraft and a small distance overhead. Even though a competitive element was added, since only one agent could execute its action, the shared reward was enough to maintain cooperation. The number of unsolved intrusions was kept low, on par with PL5, where the reward function was used to reduce the number of actions but where parallel actions were allowed. However, in our method, the number of actions used was an order of magnitude lower. Increasing the time step, such as in PL20, reduced the number of actions, but compared to our method, the reduction was smaller, and the PL20 model also performed worse, with more intrusions and a larger distance overhead.

We also showed that increasing the traffic density in the training scenarios had a positive effect on the performance, even when used in lower-density test scenarios. This is in line with the work presented by Groot et al. [35].

Increasing the number of aircraft the agent can observe should allow the agent to make better decisions; however, the test results show a reduction in performance. Using a larger observation size also results in less stability during training, which indicates that the policy had difficulties learning to utilize the increased information about its environment. Even though, in this case, observing only two aircraft results in the highest reward, it would be desirable to increase the performance when using more observations, since one could find examples of situations where only observing two aircraft may lead to sub-optimal or even unsafe actions. A solution to this may be the use of LSTM and attention modules, as suggested by Brittain et al. [26,27].

The experiments also showed that using individual termination when storing data in the playback buffer, i.e., where agents are removed from the simulation at separate times, significantly increases the stability and convergence when training a MARL policy.

Future Work

There are still many problems that need to be solved to make multi-agent reinforcement learning a usable tool for conflict resolution in operational air traffic control. One major aspect is increased realism in the simulation itself by extending it to include vertical traffic with realistic performance, as well as realistic delays when executing actions. For example, when initiating a turn, it should take some time for the pilot to react, and the aircraft should execute the turn with a realistic turn rate. It must also be considered that when issuing actions to change speed and flight level, the pilot may reject the instructions due to performance limitations. More realistic trajectories also need to be included. Since aircraft mostly follow their planned route, there may be planned trajectory change points where the aircraft turns without any interaction with air traffic control. It is also important to include more complex actions, such as combined actions and trajectory-based operations, which is currently being worked on in the SESAR 3 Joint Undertaking program. Finally, sector borders should be considered, and weather conditions should be included to introduce realistic uncertainties in the simulation.

Another key area that needs improvement is with respect to increasing the performance, especially when the number of observed neighbors is increased. Some possible solutions to this problem may be to use more advance architectures for the neural network in the policy, to find other ways to represent the data, or to use algorithms to enhance the data and/or to select the neighbors that are most relevant, not just selecting neighbors according to the closest distance.

Author Contributions

Conceptualization, J.N., J.U., and G.E.; methodology, J.N., J.U., and G.E.; software, J.N.; validation, J.N.; formal analysis, J.N., J.U., and G.E.; investigation, J.N.; resources, J.N.; data curation, J.N.; writing—original draft preparation, J.N.; writing—review and editing, J.U. and G.E.; visualization, J.N.; supervision, J.U. and G.E.; project administration, J.N.; funding acquisition, J.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by LFV and Trafikverket (grant number TRV 2019/36272) as a part of their research on automation in air traffic control.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Training Results

Appendix A.1. Validation Reward

Figure A1. Validation reward: individual vs. global termination.

Appendix A.2. Trining Reward

Figure A2. Validation reward: individual vs. global termination.

References

Eurocontrol. Eurocontrol Aviation Outlook 2050 Main Report; EUROCONTROL: Brussels, Belgium, 2022. [Google Scholar]
Bicchi, A.; Pallottino, L. On Optimal Cooperative Connict Resolution for Air Traac Management Systems. IEEE Trans. Intell. Transp. Syst. 2000, 1, 221–231. [Google Scholar] [CrossRef]
Pallottino, L.; Feron, E.M.; Bicchi, A. Conflict resolution problems for air traffic management systems solved with mixed integer programming. IEEE Trans. Intell. Transp. Syst. 2002, 3, 3–11. [Google Scholar] [CrossRef]
Erzberger, H. Automated conflict resolution for air traffic control. In Proceedings of the ICAS2006, 25th International Congress of the Aeronautical Sciences, Hamburg, Germany, 3–8 September 2006. [Google Scholar]
Stroe, G.; Andrei, I.C. Automated conflict resolution in air traffic management. INCAS Bull. 2017, 9, 91–104. [Google Scholar] [CrossRef]
Srivatsa, M.; Ganti, R.; Chu, L.; Christiansson, M.; Nilsson, J.; Rydell, S.; Josefsson, B. Towards AI-based Air Traffic Control. In Proceedings of the Fourteenth USA/Europe Air Traffic Management Research and Development Seminar, New Orleans, LA, USA, 20–24 September 2021. [Google Scholar]
Durand, N.; Alliot, J.M. Ant colony optimization for air traffic conflict resolution. In Proceedings of the Atm Seminar 2009, 8th Usa/Europe Air Traffic Management Research and Developpment Seminar, Napa, CA, USA, 29 June–2 July 2009. [Google Scholar]
Visintini, A.L.; Glover, W.; Lygeros, J.; Maciejowski, J. Monte Carlo optimization for conflict resolution in air traffic control. IEEE Trans. Intell. Transp. Syst. 2006, 7, 470–482. [Google Scholar] [CrossRef]
Qin, L.; Pan, W.; Huang, Y. Three-Dimensional Conflict Resolution Strategy Based on Particle Swarm Optimization. In Proceedings of the 2023 6th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–29 May 2023; pp. 179–185. [Google Scholar]
Ribeiro, M.; Ellerbroek, J.; Hoekstra, J. Review of conflict resolution methods for manned and unmanned aviation. Aerospace 2020, 7, 79. [Google Scholar] [CrossRef]
Wang, Z.; Pan, W.; Li, H.; Wang, X.; Zuo, Q. Review of deep reinforcement learning approaches for conflict resolution in air traffic control. Aerospace 2022, 9, 294. [Google Scholar] [CrossRef]
Razzaghi, P.; Tabrizian, A.; Guo, W.; Chen, S.; Taye, A.; Thompson, E.; Bregeon, A.; Baheri, A.; Wei, P. A survey on reinforcement learning in aviation applications. Eng. Appl. Artif. Intell. 2024, 136, 108911. [Google Scholar] [CrossRef]
Vouros, G.; Papadopoulos, G.; Bastas, A.; Cordero, J.M.; Rodrigez, R.R. Automating the resolution of flight conflicts: Deep reinforcement learning in service of air traffic controllers. In PAIS 2022; IOS Press: Amsterdam, The Netherlands, 2022; pp. 72–85. [Google Scholar]
Mukherjee, A.; Guleria, Y.; Alam, S. Deep Reinforcement Learning for Air Traffic Conflict Resolution Under Traffic Uncertainties. In Proceedings of the 2022 IEEE/AIAA 41st Digital Avionics Systems Conference (DASC), Portsmouth, VA, USA, 18–22 September 2022; pp. 1–8. [Google Scholar]
Dalmau, R.; Allard, E. Air traffic control using message passing neural networks and multi-agent reinforcement learning. In Proceedings of the 10th SESAR Innovation Days, Virtual Event, 7–10 December 2020; pp. 7–10. [Google Scholar]
Chen, Y.; Hu, M.; Yang, L.; Xu, Y.; Xie, H. General multi-agent reinforcement learning integrating adaptive manoeuvre strategy for real-time multi-aircraft conflict resolution. Transp. Res. Part Emerg. Technol. 2023, 151, 104125. [Google Scholar] [CrossRef]
Sui, D.; Ma, C.; Wei, C. Tactical conflict solver assisting air traffic controllers using deep reinforcement learning. Aerospace 2023, 10, 182. [Google Scholar] [CrossRef]
Zhao, Z.; Liang, Y.; Jin, X. Handling large-scale action space in deep Q network. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May 2018; pp. 93–96. [Google Scholar]
Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; Coppin, B. Deep reinforcement learning in large discrete action spaces. arXiv 2015, arXiv:1512.07679. [Google Scholar]
Pham, D.T.; Tran, N.P.; Alam, S.; Duong, V.; Delahaye, D. A machine learning approach for conflict resolution in dense traffic scenarios with uncertainties. In Proceedings of the Thirteenth USA/Europe Air Traffic Management Research and Development Seminar (ATM2019), Vienna, Austria, 17–21 June 2019. [Google Scholar]
Wang, Z.; Li, H.; Wang, J.; Shen, F. Deep reinforcement learning based conflict detection and resolution in air traffic control. IET Intell. Transp. Syst. 2019, 13, 1041–1047. [Google Scholar] [CrossRef]
Lai, J.; Cai, K.; Liu, Z.; Yang, Y. A multi-agent reinforcement learning approach for conflict resolution in dense traffic scenarios. In Proceedings of the 2021 IEEE/AIAA 40th Digital Avionics Systems Conference (DASC), San Antonio, TX, USA, 3–7 October 2021; pp. 1–9. [Google Scholar]
Ribeiro, M.; Ellerbroek, J.; Hoekstra, J. Distributed conflict resolution at high traffic densities with reinforcement learning. Aerospace 2022, 9, 472. [Google Scholar] [CrossRef]
Badea, C.; Groot, D.; Veytia, A.M.; Ribeiro, M.; Dalmau, R.; Ellerbroek, J.; Hoekstra, J. Lateral and vertical air traffic control under uncertainty using reinforcement learning. In Proceedings of the 12th SESAR Innovation Days, Budapest, Hungary, 5–8 December 2022. [Google Scholar]
Brittain, M.; Wei, P. Autonomous air traffic controller: A deep multi-agent reinforcement learning approach. arXiv 2019, arXiv:1905.01303. [Google Scholar]
Brittain, M.; Wei, P. Autonomous separation assurance in an high-density en route sector: A deep multi-agent reinforcement learning approach. In Proceedings of the 2019 IEEE intelligent transportation systems conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 3256–3262. [Google Scholar]
Brittain, M.W.; Wei, P. One to any: Distributed conflict resolution with deep multi-agent reinforcement learning and long short-term memory. In Proceedings of the AIAA Scitech 2021 Forum, Virtual Event, 11–15 & 19–21 January 2021; p. 1952. [Google Scholar]
Brittain, M.; Yang, X.; Wei, P. A deep multi-agent reinforcement learning approach to autonomous separation assurance. arXiv 2020, arXiv:2003.08353. [Google Scholar]
Guo, W.; Brittain, M.; Wei, P. Safety enhancement for deep reinforcement learning in autonomous separation assurance. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 348–354. [Google Scholar]
Zhao, P.; Liu, Y. Physics informed deep reinforcement learning for aircraft conflict resolution. IEEE Trans. Intell. Transp. Syst. 2021, 23, 8288–8301. [Google Scholar] [CrossRef]
Hoekstra, J.M.; Ellerbroek, J. Bluesky ATC simulator project: An open data and open source approach. In Proceedings of the 7th International Conference on Research in Air Transportation, Philadelphia, PA, USA, 20–24 June 2016; FAA/Eurocontrol: Washington, DC, USA, 2016; Volume 131, p. 132. [Google Scholar]
Farama Foundation. Gymnasium Documentation. 2024. Available online: https://gymnasium.farama.org/index.html (accessed on 19 February 2024).
Oliehoek, F.A.; Amato. The Decentralized POMDP Framework. In A Concise Introduction to Decentralized POMDPs; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Groot, D.; Ellerbroek, J.; Hoekstra, J.M. Analysis of the impact of traffic density on training of reinforcement learning based conflict resolution methods for drones. Eng. Appl. Artif. Intell. 2024, 133, 108066. [Google Scholar] [CrossRef]

Figure 1. Left: The performance of our method compared to state-of-the-art methods. The plot shows the average number of actions used vs. the average number of intrusions in the test scenarios. The different points for each method represent the results under different difficulty levels, i.e., the number of aircraft in the test scenarios. Our method (“SA”) needs very few actions to reliably solve conflicts compared to others and scales well under different difficulty levels. Right: An overview of the architecture used in our method. The agents’ actions have an additional parameter, priority, which is used to filter what action should be taken, if any.

Figure 2. A view of the simulator during playback. The black dots represent aircraft, and the surrounding black circle is half the minimum distance they should keep. The black line from each aircraft is its heading vector, with length proportional to its speed. The white circles are the exit points, and the white lines connect each aircraft to its exit point. The vertical bar on each aircraft shows the size of the action priority (blue/upward bars have positive priority, and red/downward bars are negative). The maximum current priority is shown in the top-left corner.

Figure 3. Validation reward using individual and global termination. Using individual termination (blue) shows better results than using global termination (orange), especially when using larger observation sizes in dense traffic scenarios. Top: Obs. size = 2, Scen. size = 10, Env. = SA. Bottom: Obs. size = 12, Scen. size = 20, Env. = SA.

Figure 4. Box plot comparing the number of intrusions per aircraft in the different environments. For reference, the number of intrusions without taking any action can be found in Table 4.

Figure 5. Box plot comparing the numbers of actions per aircraft in the different environments. For the PA and RE environments, only actions with a turn magnitude of at least 1 degree were counted.

Figure 6. Box plot comparing the distances overhead per aircraft in the different environments. Distance overhead is the actual flown distance in an episode divided by the minimum possible distance along a straight line from the starting point to the exit point.

Figure 7. Box plot comparing the number of intrusions per aircraft for different combinations of observation sizes, training scenario sizes, and test scenario sizes when using the SA environment. The baseline is omitted in the diagram but can be found in Table 4.

Figure 8. Box plot comparing the numbers of actions per aircraft for different combinations of observation sizes, training scenario sizes, and test scenario sizes when using the SA environment.

Figure 9. Box plot comparing the distance overhead per aircraft for different combinations of observation sizes, training scenario sizes, and test scenario sizes when using the SA environment. Distance overhead is the actual flown distance in an episode divided by the minimum possible distance along a straight line from the starting point to the exit point.

Table 1. Per aircraft observed data, where N is the number of observed aircraft. If fewer than

N + 1

aircraft remain in the simulation, the remaining slots for observed aircraft are filled with predefined large values for the distance components of time to conflict and a relative heading away from the observing aircraft.

Table 1. Per aircraft observed data, where N is the number of observed aircraft. If fewer than

N + 1

aircraft remain in the simulation, the remaining slots for observed aircraft are filled with predefined large values for the distance components of time to conflict and a relative heading away from the observing aircraft.

SA, PA, PL5, and PL20 Environments	RE Environment
Own aircraft
Own airspeed	Own airspeed
Bearing to exit point (sin component)	Bearing to exit point (sin component)
Bearing to exit point (cos component)	Bearing to exit point (cos component)
Distance to exit point	Optimal airspeed
Flag: 1 if heading points to exit point, else 0
Observed aircraft, repeated N times
Distance to aircraft	Distance to aircraft
Distance to aircraft along the x axis	Distance to aircraft along the x axis
Distance to aircraft along the y axis	Distance to aircraft along the y axis
Relative heading (sin component)	Relative heading (radians)
Relative heading (cos component)	Distance to aircraft in 20 s
Time to conflict

Table 2. The per aircraft reward is calculated as described in this table, where the intrusion cost and drift cost are the same as defined in [24]. The simulation returns an average reward for the remaining aircraft.

Reward	SA	PA	PL5/20	RE
Action cost, where $α$ is the turn size in radians and p is the action priority.	$- 0.1 (p + \|α\|), p > 0$	$- 0.1 \|α\|$	$- 0.1, \|α\| \geq 1^{\circ}$	0
	$0, p \leq 0$		$0, p < \|α\| < 1^{\circ}$	0
Intrusion cost, where C is the number of other aircraft closer than 5 NM.	$- 10 * C$
Drift cost, where $δ$ is the difference between the current heading and the heading to the exit point in radians.	$0.2 (0.5 - \|δ\|)$

Table 3. The validation results from the last 50% of the episodes of training. Note that the reward is not comparable between different environments, since the reward function is different. Also note that the number of intrusions in the PL20 environment is not comparable with that in the other environments due to its longer update time.

			Average No. of Confl./Aircraft		Average Reward
Obs	Size	Env	Ind. Term.	Global Term.	Ind. Term.	Global Term.
2	10	SA	0.48 ± 0.65	0.71 ± 0.96	13.29 ± 5.33	9.95 ± 7.77
2	15	SA	0.75 ± 0.93	1.10 ± 1.19	11.60 ± 8.15	9.14 ± 7.05
2	20	SA	1.28 ± 1.22	2.58 ± 2.14	8.99 ± 8.38	3.83 ± 11.21
4	10	SA	0.46 ± 0.58	1.20 ± 1.02	12.99 ± 6.39	6.86 ± 7.85
4	15	SA	0.76 ± 0.73	1.72 ± 1.53	11.77 ± 5.60	6.23 ± 10.28
4	15	PL5	0.84 ± 1.13		1.68 ± 10.30
4	15	PL20	0.46 ± 0.23		−0.25 ± 2.39
4	15	PA	0.73 ± 1.02		7.90 ± 8.47
4	15	RE	1.16 ± 1.32		48.51 ± 30.05
4	20	SA	1.52 ± 1.33	3.11 ± 2.37	7.63 ± 7.75	−0.19 ± 11.22
8	10	SA	1.02 ± 0.86	1.65 ± 1.22	13.07 ± 4.50	8.32 ± 7.29
8	15	SA	1.62 ± 1.27	2.53 ± 1.69	10.29 ± 6.26	5.72 ± 9.59
8	20	SA	3.22 ± 2.21	4.54 ± 3.04	6.66 ± 7.93	1.37 ± 9.99
12	15	SA	1.62 ± 1.38	3.13 ± 1.79	10.94 ± 7.18	2.94 ± 10.00
12	20	SA	3.86 ± 2.77	8.38 ± 4.18	6.08 ± 7.72	−32.29 ± 26.00

Table 4. The numbers of steps and intrusions in the test scenarios when no actions are executed, i.e., all aircraft following a straight line from starting point to exit.

Scen. Size	Avg. No. Intrusions	Avg. No. Intrusions/AC	Avg. Steps/Episode
10	13.024	1.302	199.3
15	29.094	1.940	203.4
20	51.604	2.580	206.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nilsson, J.; Unger, J.; Eilertsen, G. Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions. Aerospace 2025, 12, 88. https://doi.org/10.3390/aerospace12020088

AMA Style

Nilsson J, Unger J, Eilertsen G. Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions. Aerospace. 2025; 12(2):88. https://doi.org/10.3390/aerospace12020088

Chicago/Turabian Style

Nilsson, Jens, Jonas Unger, and Gabriel Eilertsen. 2025. "Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions" Aerospace 12, no. 2: 88. https://doi.org/10.3390/aerospace12020088

APA Style

Nilsson, J., Unger, J., & Eilertsen, G. (2025). Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions. Aerospace, 12(2), 88. https://doi.org/10.3390/aerospace12020088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Prioritizing Multi-Agent Reinforcement Learning for Conflict Resolution in Air Traffic Control with Limited Instructions

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. Method and Experimental Setup

2.1. ATC Simulation

2.1.1. Scenario Generation

2.1.2. Environments

2.1.3. The Priority Filter

2.2. Model

2.3. Training, Validation, and Tests

3. Results

3.1. Training and Validation Results

3.2. Test Results

4. Discussion and Conclusions

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Training Results

Appendix A.1. Validation Reward

Appendix A.2. Trining Reward

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI