1. Introduction
In multi-agent intelligence, it would be common to encounter failed training, like the occurrence of non-convergence or low training speeds, if only individual agent learning methods were applied [
1]. Multi-agent intelligence learning methods have been developed vigorously in recent years and amongst them the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method is widely studied as a basic algorithm since it shows excellent ability to export effective actions in multi-agent behaving problems [
2]. For example, the MAGNet method [
3] is one of the improved versions by perfecting the attention mechanism; the Decomposed Multi-Agent Deep Deterministic Policy Gradient (DE-MADDPG) method [
4] is another improved version by coordinating local and global rewards to speed up convergence; the MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) method [
5], by promoting multi-agent adaptation to the environment; and the CAA-MADDPG [
6], by increasing the attention mechanism. All these mentioned multi-agent intelligence learning methods inherit the cooperative learning strategies and show satisfying convergence but cannot achieve effective prediction and tracking when facing unfamiliar environments.
Some scholars attained upgraded predictive ability for their learning methods by introducing well-designed prediction framework. For example, Chien, JT and Hung, PY [
7] combined prediction network and auxiliary replay memory in Deep Q Network (DQN) to estimate multiple states of the targets under different actions in order to support an advanced motion prediction. Zhu, PM and Dai, W [
8] proposed a PER-MADDPG method that uses information from multiple robots in the learning process to better predict the possible actions of the robot and thus to improve the learning efficiency. A deep neural network for evader pursuits by coordinating instant prediction and tracking was designed in [
9]. Ure, NK and Yavas, MU [
10] designed a novel learning framework for the prediction model used in Adaptive Cruise Control System, which enhances the situational awareness of the system by predicting the actions of surrounding drivers. In [
11], a multi-agent observation and prediction system was established, which can realize the prediction and analysis of hurricane trajectory. Weng, XS [
12] adopted the feature interaction technique from Graph Neural Networks (GNNs) to capture the way in which the agents interact with one another. The above studies show that for most multi-agent intelligence learning approaches based on cooperative learning strategy, it is feasible to reinforce predictive ability and to improve their learning efficiency by adding certain prediction network modules. However, the expected self-adaptation in unfamiliar environments had not been fulfilled yet.
Aiming for that, Wei, XL [
13] improved the RDPG algorithm and achieved an increase in the action detection efficiency of the UAV together with radar stations under an unfamiliar battlefield environment by combining the MADDPG and RDPG methods. In [
14], a new intrinsic motivation mechanism was introduced, Group Intrinsic Curiosity Module (GICM), into the MADDPG method. The GICM module encourages each agent to reach innovative and extraordinary states for a more comprehensive collaborative search of the current scenario in order to expand the adaptability of unfamiliar scenes. In [
15], a parametric adaptive controller based on reinforcement learning was designed to improve the adaptability of the submarine in deep sea. Raziei, Z [
16] developed and tested a Hyper-Actor Soft Actor-Critic (HASAC) deep reinforcement learning framework to enhance the adaptability of the agent when facing new tasks. Furuta, Y [
17] designed an autonomous learning framework to ensure that the robot can adapt to different home environments. The above studies focus more on adaptability improvement and do not take promoting prediction and tracking ability to be the same priority. Therefore, the method’s prediction and tracking ability cannot be reflected at the same time. The comparison of related research to this paper is shown in
Table 1.
In summary, most multi-agent learning approaches are based on cooperative learning strategies instead of non-cooperative equilibrium, and most of them lack the function of motion prediction and tracking. In order to achieve the prediction and tracking of the targets’ action intention, as well as their self-adaptation to an unfamiliar environment simultaneously, a set of performance discrimination module and random mutation module is designed in this paper, and the non-cooperative strategy is applied, aiming to enhance the adaptability of the agents in unfamiliar environments.
This approach is named as the Multi-Agent Motion Prediction and Tracking method based on Non-Cooperative Equilibrium and is shortened as the MPT-NCE in the rest of this paper. By using the MPT-NCE, the agents can quickly identify unfamiliar environment information based on previous knowledge reserves, and are able to figure out the action intention of their counterparts and then track them in the game antagonism environment. Compared with the MADDPG learning method, the proposed MPT-NCE has a faster convergence speed and better performance. The remaining sections are arranged as follows:
Section 2 is about the related work, in which the MADDPG multi-agent intelligence learning method is introduced briefly.
Section 3 presents the details of the MPT-NCE, in which the random mutation module and performance discrimination module are designed to realize the motion prediction and tracking of the agents.
Section 4 provides the performance of our method in two different experiments to evaluate its effectiveness. Discussion and summary are presented at last in
Section 5.
4. Experiments and Results
4.1. Experimental Environment
In order to verify the effectiveness of the MPT-NCE, two experiments were designed and conducted in this section. The experimental conditions are different in the number of agents and the complexity of the environments. For illustration, blue spheres are used to represent the detection agents, red spheres represent the interference agents, and purple spheres represent the mutable interference agents. The mutated agent is labeled by purple sphere with expanded size.
Equation (11) is the reward formula for detection agents, where is the number of interference agents in the experiment, is all interfering agents, and and are the horizontal and vertical coordinates where the agent locates. Equation (12) is the reward formula for the interference agents, where is the number of detection agents in the experiment, and is all detection agents. The reward of the detection agents increases when the distance between the detection agents and the interference agents becomes smaller. Similarly, the reward of the detection agents decreases when the distance between the detection agents and the interference agents becomes larger. The detection agents collide with the interference agents, indicating that the detections are effective. The interference agents maintain the distance with the detection agents can get reward, and if the purple interference agents are successfully detected, it will produce random mutation, the purple balls will become larger schematically. Stability and convergence are the Experimental Evaluation Index. This article sets the steady state of the curve as the curve’s final reward fluctuations to not exceed 10% of the overall value. Additionally, the episode to achieve stable reward can reflect the convergence, and the convergence is better if the episode required is smaller.
The performance discrimination module guides the critic network update through the gradient descent method. The usage of the value of the agent itself in the loss function can enhance the prediction ability of the MPT-NCE method, so that the reward curve has better convergence. The random mutation module guides the policy network update through the gradient ascent method with the aim of obtaining a larger value with value, where the value is an assessment value according to the agent’s state and action, and is proportionate to the increased current rewards of the agent.
In summary, the fast convergence speed and high reward values can reflect the effectiveness of the MPT-NCE method. In the prediction experiments, the convergence speed of the reward curves can reflect the prediction ability of the MPT-NCE method, and the curve reaches stability indicating that the prediction of the opposing agent’s behavior is achieved. In the tracking experiments, the reward values of the reward curves reflect the tracking ability of the MPT-NCE method, and the higher reward values indicate the closer the distance between the blue detection agents and the interference source.
The experimental system uses Ubuntu 18.04 and the processor is Intel(R) i7-10875H. In addition, the experimental code is based on the parl architecture. The other used parameters are set as in
Table 3.
4.2. Prediction Experiment
The prediction experiment contains 4 blue detection agents, 4 red interference agents, 2 gray hidden areas, and 2 black obstacle areas. All agents are entities, and agents with entity properties will bounce off after collision. After the agents enter the hidden area, the agents outside the hidden area cannot obtain its position. The black obstacle areas are impassable areas, in which the passage of any agent is prevented.
The process of prediction experiment is shown in
Figure 4. To test the prediction ability of the blue detection agents, the active area of the red interference agents is limited to the red dashed box. As shown in
Figure 4a, the red interference agents are targets for the blue detection agents whose task is to detect the red interference agents. At the early stage of training, the blue detection agents have no predictive capability and all appear outside the dotted box. As shown in
Figure 4b, as the training times increases, some of the blue detection agents soon move into the red dotted box. This phenomenon indicates that the behavior of the blue detection agents is not a unified decision, which reflects the non-cooperative decision of the MPT-NCE method. As shown in
Figure 4c, after training, the experimental scenario was re-demonstrated, and all the blue detection agents had moved into the red dotted box, indicating that the blue detection agents all have the ability to predict the action pattern of the red interference agents. The results of the prediction experiment show that the trained blue detection agents can predict the behavior pattern of the red interference agents. The time difference function of the performance discrimination module guides the update of the critic network so that the blue detection agents have the prediction function. In order to reflect the effectiveness of the prediction results, this paper adopts a binomial distribution test for the prediction rate, randomly selecting 100 scenes for observation, using 1 to indicate the success of the detection monomer prediction and 0 to indicate the failure of the prediction. The result shows that the number of 1 is 92, which proves that the prediction rate of the MPT-NCE method reaches 90%.
Figure 5 shows the rewards in the prediction experiment. In
Figure 5a, the multi-agent rewards during training increases significantly, indicating that the random mutation module guides the
value to increase, which proves the effectiveness of the MPT-NCE method.
The reward curve becomes steady latterly, which indicates that the used policy network works as expected, the prediction of the counterparts’ behavior in unfamiliar environment is realized, and the non-cooperative equilibrium has been reached. The MPT-NCE method has faster convergence speed and higher value of the reward function compared with the MADDPG method, which indicates that MPT-NCE method has much better prediction ability and effect. In
Figure 5b,c, the agents evolve independently while the non-cooperative distribution strategy is carried out during the training of the blue and red agents. The reward function curves of the both sides finally reach the equilibrium steady state, indicating that the performance discrimination module improves the convergence speed of the method and proves the effectiveness of the MPT-NCE method.
In order to further verify the prediction effect of the proposed method, the training times and the value of the rewards are evaluated and compared with that of the the MADDPG method.
Figure 6 illustrates the comparason in detail, all the presented results in
Figure 6 are obtained when the accelerations of the blue detection agents and the red interference agents are both set to be 4.
As shown in
Figure 6a, the training times of MPT-NCE method are compared with that of MADDPG method. It is obvious that the MPT-NCE method reaches stability in a shorter time, and the convergence rate is increased by 23.52% compared with MADDPG method. As shown in
Figure 6b, the comparison of the reward function values of MPT-NCE method and MADDPG method shows that the reward of the MPT-NCE method is larger, which is improved by 16.89% compared with the MADDPG method.
4.3. Tracking Experiment
The tracking experiment contains 4 blue detection agents, 1 red interference agent, 2 purple mutable interference agents, 2 gray hidden areas, and 2 black obstacle areas. The purple mutable interference agents will mutate with the detection of the detection agents.
Tracking experiment 1 is illustrated by
Figure 7. After training, all agents can clear the target and complete the task. In
Figure 7a, the targets of blue detection agents are the red ones and the purple interference agents, and their task is to detect the targets. The targets of red and purple interference agents are the blue detection agents, and their task is to increase the distance to the blue detection agents. As shown in
Figure 7b, each agent acts according to the task target. As shown in
Figure 7c, the blue detection agents are able to track the red and purple interference agents, while the red and purple interference agents try to increase the distance from the blue detection agents.
Tracking experiment 2 is illustrated by
Figure 8, when the purple interference agent is successfully detected, the volume variation increases significantly. In
Figure 8a, it can be seen that the blue detection agent targets are red and purple mutable interference agents. As shown in
Figure 8b, each agent acts according to the task target. As shown in
Figure 8c, it can be seen that the blue detection agents can track the mutated purple interference agents. The tracking experiment results show that the trained blue detection agents can effectively track the mutated purple interference agents. The random mutation module can influence the output of the policy network on the basis of predictive learning, and realize the behavior tracking of the other party in unfamiliar environment. It reflects the self-adaptation of the MPT-NCE method in unfamiliar environment.
In the tracking experiments, the reward value reflects the tracking ability of the MPT-NCE method. The closer the distance between the blue detection agent and the interference source, the higher the reward value.
Figure 9 shows the curve of the reward in the tracking experiment. In
Figure 9a, the multi-agent reward of the MPT-NCE method rises significantly after training, and has a faster convergence rate with higher reward function compared with the MADDPG method, which proves the effectiveness of the MPT-NCE method. In
Figure 9b,c, when blue detection agents and red interference agents are trained, the evolution process of each agent is independent of each other, and the non-cooperative distribution strategy is implemented. The reward function curves both reach the equilibrium steady state, indicating that the experimental results reach the non-cooperative equilibrium.
To further verify the tracking effect of the MPT-NCE method compared with the MADDPG method, the training times and rewards are checked, using the approach introduced previously in
Section 4.2. The acceleration of all the agents including the blue detection ones, the red, or purple interference ones were set to be the same.
As shown in
Figure 10a, the training times of MPT-NCE method and MADDPG method are compared. It is obvious that MPT-NCE method is stable in a shorter time and the convergence rate is increased by 11.76% compared to MADDPG method.
Figure 10b shows the comparison of the reward function values of MPT-NCE method and MADDPG method. The reward function value of the MPT-NCE method is larger and the tracking performance is improved by 25.85% compared to the MADDPG method.
4.4. Discussion of Experimental Results
The results of the prediction and tracking experiments are summarized in
Table 4 and
Table 5.
Table 4 shows the comparison of the stable reward values under different experimental environments. The rewards after training by the MPT-NCE method are all significantly improved compared to the MADDPG method. In the prediction experiments, the overall, detection and interference reward curves after training with the MPT-NCE method are improved by 16.89%, 8.85%, and 21.77%, respectively, compared with the MADDPG method. In addition, in the tracking experiment, the overall, detection and interference reward curves of the MPT-NCE method after training are improved by 25.86%, 13.46%, and 17.47%, respectively, compared with those of the MADDPG method.
Table 5 shows the comparison of the stable training times under different experimental environments. The training speed of the MPT-NCE method is significantly improved compared to the MADDPG method. In the prediction experiments, the convergence speed of the overall, detection, and interference curves of the MPT-NCE method after training are improved by 23.52%, 31.25%, and 20%, respectively, compared with the MADDPG method. In addition, in the tracking experiment, the convergence speed of the overall, detection, and interference curves of the MPT-NCE method after training are improved by 11.76%, 17.64%, and 18.75%, respectively, compared with those of the MADDPG method.
In summary, in the prediction experiment, the convergence speed of the curve can reflect the prediction ability of the method, and the prediction ability of MPT-NCE method is 23.52% higher than that of MADDPG method. In the tracking experiment, the reward after curve stability can reflect the tracking performance of the method, and the tracking ability of MPT-NCE method is 25.85% higher than that of MADDPG method. Compared with the MADDPG method, the MPT-NCE method proposed in this paper has better prediction and tracking performance, so it can be applied to the training scenarios of multi-agent prediction and tracking.
5. Conclusions
In this paper, a Multi-Agent Motion Prediction and Tracking method based on non-cooperative equilibrium (MPT-NCE) is proposed. Taking the MADDPG method as foundation, a set of novel performance discrimination module and random mutation module is designed, aiming for quick identification and great adaptability to strange environments, and also leading to the realization of motion prediction and tracking of the counterparts in the game confrontation environment. The performance discrimination module using the time difference function guides the critic network to update and improve the prediction ability of the method. The random mutation module guides the update of policy network on the basis of predictive learning, and realizes the tracking of opposite agents’ behavior intention. The proposed MPT-NCE method can be applied to various multi-target pursuit scenarios, and of high potential to solve many problems such as target prediction and task pursuit in other systems. In addition, future research will focus on solving prediction tracking problems in dynamic environments and integrating MPT-NCE with other systems.
In order to verify the multi-agent prediction and tracking ability of the proposed method, two groups of multi-agent experiments were conducted. The results of the prediction experiment show that compared with the MADDPG method, the prediction ability of the MPT-NCE method is improved by 23.52%, the performance is improved by 16.89%, and the prediction rate is more than 90%. The results of tracking experiment show that the MPT-NCE method improves the convergence rate by 11.76% and the tracking ability by 25.85% compared with the MADDPG method. The experimental results effectively prove that the MPT-NCE method has superior environmental adaptability and prediction tracking ability.