An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm

Qu, Shaochun; Guo, Ruiqi; Cao, Zijian; Liu, Jiawei; Su, Baolong; Liu, Minghao

doi:10.3390/app14188383

Open AccessArticle

An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm

by

Shaochun Qu

¹,

Ruiqi Guo

¹,

Zijian Cao

^1,*,

Jiawei Liu

¹,

Baolong Su

¹ and

Minghao Liu

²

¹

School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China

²

Jiangsu Automation Research Institute, Lianyungang 222061, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8383; https://doi.org/10.3390/app14188383

Submission received: 17 July 2024 / Revised: 26 August 2024 / Accepted: 12 September 2024 / Published: 18 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Due to the advantages of a centralized critic to estimate the Q-function value and decentralized actors to optimize the agents’ policies, counterfactual multi-agent (COMA) stands out in most multi-agent reinforcement learning (MARL) algorithms. The sharing of policy parameters can improve sampling efficiency and learning effectiveness, but it may lead to a lack of policy diversity. Hence, to balance parameter sharing and diversity among agents in COMA has been a persistent research topic. In this paper, an effective training method for a COMA policy network based on a differential evolution (DE) algorithm is proposed, named DE-COMA. DE-COMA introduces individuals in a population as computational units to construct the policy network with operations such as mutation, crossover, and selection. The average return of DE-COMA is set as the fitness function, and the best individual of policy network will be chosen for the next generation. By maintaining better parameter sharing to enhance parameter diversity, multi-agent strategies will become more exploratory. To validate the effectiveness of DE-COMA, experiments were conducted in the StarCraft II environment with 2s_vs_1sc, 2s3z, 3m, and 8m battle scenarios. Experimental results demonstrate that DE-COMA significantly outperforms the traditional COMA and most other multi-agent reinforcement learning algorithms in terms of win rate and convergence speed.

Keywords:

multi-agent; reinforcement learning; parameter sharing; differential evolution

1. Introduction

Many complex reinforcement learning (RL) problems, such as drone control, distributed computing resource management, and video games, are naturally modeled as cooperative multi-agent systems [1,2,3]. RL is a primary approach to solving sequential decision-making problems, which requires agents to continuously interact and experiment with the environment. As agents interact with the environment through actions, the actions of agents are evaluated in the environment by providing immediate rewards [4]. Positive rewards increase the probability of taking the corresponding action, while negative rewards decrease this probability. Additionally, the actions of agents can modify the environment, and it requires repeated interactions to find the optimal strategy by maximizing the expected cumulative reward. In a single-agent task, RL only needs to consider the actions and states of a single agent. In contrast, multi-agent reinforcement learning (MARL) needs to consider a much larger action and state space. The reward of each agent is not only related to the environment but also related to the actions of other agents, therefore the multi-agent learning tasks become more complex [5].

To address the complexity of multi-agent systems, scholars have conducted extensive research on training and execution architectures. In the training architecture of multi-agent reinforcement learning, there are mainly two approaches. One is centralized training decentralized execution (CTDE) and the other is decentralized training decentralized execution (DTDE) [6]. As shown in Figure 1, CTDE architecture involves all agents sharing a common set of network parameters during the training process. The advantage of CTDE is that it can learn global information well, such as the strategies and states of other agents, and it can represent collaborative performance more effectively. However, during the execution phase, each agent operates independently by utilizing its own strategy without the need to share parameters to achieve decentralized execution. This method partially resolves the consistency issue between training and execution while achieving high execution speed and efficiency. Consequently, CTDE architecture has become the most commonly cited structure for MARL [7].

COMA [8] is a MARL algorithm based on an actor-critic (AC) framework [9]. Specifically, COMA introduces improvements and innovations in several aspects. Firstly, COMA utilizes a centralized critic and a decentralized actor. The former is only used during the learning process, and the latter is only needed during execution. During the learning phase, COMA can leverage a critic that depends on joint actions and all available state information. The policy of each agent depends only on its action-observation historical information. This centralized learning mode can better utilize global information to optimize collaboration effects. Secondly, COMA employs a counterfactual baseline [10]. By using a centralized critic to calculate agent-specific advantage functions, it compares the expected return of the current joint action with a counterfactual baseline that marginalizes over a counterfactual action of one agent while maintaining the actions of other fixed agents. Then, a separate baseline is computed for each agent, relying on the centralized critic to infer counterfactuals in which only the action of the agent is changed. COMA can estimate more accurately the contribution of each agent’s policy to the global return. Thirdly, a critic representation is used to enable efficient computation of counterfactual baselines. In a single forward pass, Q-values are computed for all different actions of a specific agent, which are under the joint conditions of actions of other agents. These improvements make COMA perform exceptionally well in solving cooperative tasks in multi-agent scenarios.

COMA shares a set of policy network parameters, as depicted in Figure 2, which improves sampling efficiency and learning performance to some extent. However, this parameter sharing gives rise to a series of issues, which includes a lack of diversity and suboptimal convergence in the policy network parameters. Although an intuitive solution might be to allow each agent to possess different policy network parameters, as illustrated in Figure 3, this approach introduces its own set of problems. On one hand, in CTDE architecture, agents share a common set of network parameters to better learn global information, including the strategies and states of other agents, and it may lead to improved optimization of collaborative performance. The multiple sets of policy network parameters may complicate the global information and potentially affect the agents’ collaborative objectives. Therefore, parameter sharing demonstrates a clear advantage in the transmission of global information. On the other hand, in a multi-agent environment, the policies of agents are often highly correlated. If each agent has its own independent network parameters and needs to learn similar policies, it may require a larger number of samples to train each agent’s network and increases the required sample size for training. Conversely, by utilizing the similarity between agents, the method of sharing network parameters can effectively reduce the required training sample size and improve sampling efficiency.

The trade-off between the sharing of network parameters and personalization is an ongoing research topic. The sharing of network parameters can effectively improve sampling efficiency and learning effectiveness, but it may lead to a lack of diversity and poorer convergence of policies. On the other hand, independent parameters may increase training complexity and sample requirements.

To address the trade-off between the sharing of network parameters and personalization in MARL, researchers continuously explore various parameterization schemes and mechanisms to promote policy diversity. Among these, the evolutionary algorithm (EA) [11,12] is a family of random search algorithms developed based on the principles of natural evolution. The main goal of EA [13] is to iteratively explore the solution space through randomness and population-based search to seek the global optimum for optimization problems. The core characteristic of EA lies in its randomness to avoid getting trapped in local optima. By using adaptive selection and genetic operations, EA passes down excellent individuals and continuously improves and optimizes the current solution. This parallelism and global search characteristic make EA perform well in solving complex problems and situations. Therefore, EA can be a valuable method to increase policy diversity, optimize collaboration effects, and balance the relationship between parameter sharing and personalization.

Based on the above analysis, we propose a policy network training method based on the differential evolution (DE) algorithm, named DE-COMA, which utilizes the classic DE algorithm to optimize the policy networks of multiple agents in the COMA algorithm. DE-COMA encodes the policy network parameter model with floating-point vectors, generates an evolutionary population, and enriches the representation of the policy network through mutation and crossover operations. In each generation, the best agent policy network parameters are evolved through the selection operation. By combining evolution and learning, DE-COMA aims to enhance the diversity of agent policies and improve the training convergence. The result provides strong support for the application of EAs to MARL.

In this paper, we conducted comparative experiments between DE-COMA and other multi-agent algorithms in the StarCraft II challenge environment. The experimental results validate that DE-COMA exhibits higher win rates and more stable convergence performance compared to the original COMA algorithm in the SMAC environment. Furthermore, the performance of DE-COMA also tends to be competitive with the currently best benchmark algorithms. These experimental results confirm the effectiveness of the DE-COMA method in MARL tasks. The introduction of DE-COMA brings a new research direction to the field of MARL, combining evolutionary algorithms with COMA, and provides a new approach to address issues such as policy diversity and convergence. The experimental results show that it may bring new opportunities and challenges in the future.

The remaining sections of the paper are organized as follows. Section 2 provides a summary of the work related to MARL. Section 3 introduces the details of the COMA and DE. Section 4 elaborates on the design details of the proposed DE-COMA. In Section 5, experimental results and a comparative analysis of DE-COMA algorithms are presented. Finally, Section 6 concludes the paper.

2. Related Work

MARL has been applied in various environments [14], and significant progress has been made in deep multi-agent reinforcement learning to high-dimensional input and action spaces. For instance, Chen et al. [15] combined DQN with independent Q-learning [16] to learn cooperation in a two-player Pong game. Additionally, Leibo et al. [17] also used the same approach to study the emergence of cooperation and betrayal in sequential social dilemmas.

The use of evolutionary algorithms (EAs) to assist in reinforcement learning training has a rich history, dating back to the early work on neuroevolution. Neuroevolution leverages EAs to generate the weights and/or topological structures of artificial neural networks as agent policies [18]. Early neuroevolution works primarily focused on evolving weights of small-scale and fixed-architecture neural networks. However, recent advances have demonstrated the potential of evolving architectures together with neural network weights in complex reinforcement learning tasks [19]. Additionally, a new perspective [20] on policy search has been established by solely conducting architecture search while ignoring weights. Since the introduction of OpenAI ES [21], evolutionary reinforcement learning has garnered increasing attention in both the evolution computation and reinforcement learning communities. These advancements have brought new possibilities for applying EAs to RL tasks and have sparked broader research interest in the MARL field.

In applying RL to micro-management of units in StarCraft, most studies utilized centralized controllers with full state access and control over all units, although the controller’s architecture leveraged the multi-agent nature. For instance, Usunier et al. [22] used a greedy Markov decision process (MDP), where, at each time step, all previous actions were given, and actions were selected for agents one by one by combining with zeroth-order optimization. Additionally, Peng et al. [23] employed the actor-critic method, relying on recurrent neural networks to exchange information among agents.

This work directly builds upon the idea of EAs for policy search in MARL. The relationship between COMA and this research direction is discussed in the Section 4 of the paper.

3. Background

3.1. Fully Cooperative Tasks under Partial Observability

A multi-agent task [24] can be depicted as a stochastic game

G

, abstracted by the tuple

G = < S, U, P, r, Z, O, n, γ >

, in which n represented by

a \in A \equiv {1, \dots, n}

make sequential actions with a true state

s \in S

. An agent simultaneously selects action

u^{a} \in U

, shaping

a

joint action

u \in U \equiv U^{n}

, which causes a transition by the state transition function

P (s^{'}| s, u) : S \times U \times S \to [0,1]

. All agents share the same reward function

P (s^{'}| s, u) : S \times U \times S \to [0,1]

, and

γ \in [0,1]

is a discount factor.

In partially observable scenarios, agents acquire observations

z \in Z

based on the observation function

O (s, a) : S \times A \to Z

. Each agent owns an action-observation

τ^{a} \in T \equiv {(Z \times U)}^{*}

, in which actions are taken based on a conditional stochastic policy

π^{a} (u^{a}| τ^{a}) : T \times U \to [0,1]

. This paper employs bold notation to denote joint quantities concerning agents and uses superscript

- a

to represent joint quantities of other agents except the given agent

a

.

The discounted return is defined as

R_{t} = \sum_{l = 0}^{\infty} γ^{l} r^{t + l}

. The joint policy of the agents leads to a value function, which is the expectation over

R_{t}

,

V^{π} (s_{t}) = E_{s_{t + 1} : \infty, u_{t : \infty}} [R_{t}| s_{t}]

, and an action-value function

Q^{π} (s_{t}, u_{t}) = V^{π} (s_{t}) = E_{s_{t + 1} : \infty, u_{t : \infty}} [R_{t}| s_{t}]

. The advantage function is denoted by

A^{π} (s_{t}, u_{t}) = Q^{π} (s_{t}, u_{t}) - V^{π} (s_{t})

.

3.2. Counterfactual Baseline

Firstly, COMA employs a centralized critic. It is important to note that in independent actor-critic (IAC) setups, each actor

π (u^{a}| τ^{a})

and each critic

Q (τ^{a}, u^{a})

or

V (τ^{a})

conditions solely on the agent’s own action-observation history

τ^{a}

. However, during learning, only critics are used, while actors are needed solely during execution. Due to the centralized nature of learning, a centralized critic can be assumed to condition on the real global state

s

(if available) or the joint action-observation history

τ

. Each actor conditions on its own action-observation history

τ^{a}

, employing parameter sharing.

COMA utilizes counterfactual baselines, inspired by the concept of a differential reward [25], which is a shaped reward

D^{a} = r (s, u) - r (s, (u^{- a}, c^{a}))

. The global reward is compared with the reward obtained when agent

a

’s action is replaced with a default action

c^{a}

. Any improvement

D^{a}

in

a

’s action also enhances the actual global reward

r (s, u)

, as

r (s, (u^{- a}, c^{a}))

is independent of

a

’s actions.

A key aspect of COMA is that it leverages a centralized critic to implement differential rewards, thus circumventing these issues. COMA learns a centralized critic

Q (s, u)

, which estimates the

Q

-values of joint actions

u

given a central state

s

. For each agent

a

, an advantage function can be computed by comparing the

Q

-value of the current action

u^{a}

with a counterfactual baseline that marginalizes out

u^{a}

while keeping the actions of other agents

u^{- a}

constant:

A^{a} (s, u) = Q (s, u) - \sum_{u^{' a}} π^{a} (u^{' a}| τ^{a}) Q (s, (u^{- a} . u^{' a}))

(1)

where

A^{a} (s, u^{a})

is a separate baseline for each agent, utilizing the centralized critic to infer counterfactuals where only the actions of agent

a

change. This inference is learned directly from the agent’s experience, avoiding the need for additional simulations, reward models, or user-designed default actions.

3.3. Differential Evolution

Differential evolution (DE) [26] is a stochastic heuristic search algorithm that consists of four main steps: initialization, mutation, crossover, and selection. In DE, a population of feasible solutions is first randomly initialized to form an evolving population, where the genes of each individual represent the dimensions of the problem. Subsequently, a crossover operator is used to perform differential vector operations on two randomly selected individuals from the population, generating a mutated individual. This mutated individual is then combined with another individual using the crossover operator to produce a new individual. Finally, a selection operator is employed to retain superior individuals to form the next generation population.

This optimization approach employs the specific fitness function to iteratively refine the encoded population of individuals through the aforementioned steps, making it particularly suitable for solving problems involving finding optimal solutions. Therefore, integrating the principles of differential evolution into the evolution of the COMA strategy network model is used to enhance the diversity of intelligent agent strategies and improve training effectiveness.

4. Training Method of COMA Policy Network Based on Differential Evolution

4.1. COMA Policy Network

In COMA, each agent utilizes the same policy network (actor network) to choose actions based on the current state. The objectives of these actor networks are to maximize the individual reward for each agent. The role of actor networks in COMA is to generate probability distributions for actions for each agent. These probability distributions are used to calculate gradients, enabling updates to the policy network. By introducing a joint action-value function, the policy updates for each agent take into account the actions of other agents to facilitate collaborative decision making.

4.2. Optimization of COMA Policy Network with DE

As is well known, the ultimate goal of reinforcement learning tasks is to train an excellent policy that guides the agent to make the most reasonable actions at each time step. Therefore, policy training requires special attention. The trade-off between parameter sharing and individualization is an ongoing research topic. Sharing parameters can effectively improve sampling efficiency and learning effectiveness but may lead to a lack of policy diversity and poor convergence. On the other hand, independent parameters may increase training complexity and sample requirements. Firstly, the centralized training and decentralized execution architecture allow agents to share a set of network parameters for better learning of global information. However, a single policy network may result in poor algorithm convergence and a lack of policy diversity among agents. On the other hand, using multiple sets of policy network parameters may lead to complex global information and increase the required sample size for training, thus affecting the agents’ collaborative objectives.

Therefore, the balance between policy diversity and parameter sharing remains an unresolved issue. Additionally, due to the powerful representation capabilities of neural networks and their deep connection with parameter settings and adjustments, they play a crucial role in determining the algorithm’s convergence ability. Finding the best output that aligns with the current state is the primary task of neural networks. This study models the problem as a combinatorial optimization problem and employs evolutionary algorithms to find the most suitable network parameters for solving the problem model.

To address the trade-off between policy network diversity and parameter sharing in COMA, this paper optimizes the COMA policy network using the DE algorithm. The policy network is encoded as a set of vectors, representing the evolutionary individuals. Throughout the evolutionary process, survival of the fittest occurs within the population.

4.2.1. Individual Encoding

In DE-COMA, population individuals serve as the smallest units for operations.

x_{i} = \{f c_{1} w e i g h t, f c_{1} b i a s, r n n_{1} w e i g h t, r n n_{1} b i a s, r n n_{2} w e i g h t, r n n_{2} b i a s, r n n_{3} w e i g h t, r n n_{3} b i a s, f c_{2} w e i g h t, f c_{2} b i a s\}

(2)

The individual coding information is given in Table 1, and the individual coding is as Formula (2). In Figure 4, the role of the CONVERTER is to transform each row of the strategy network listed in Table 1 into an individual in the population during the evolution process. On the left side of the CONVERTER is the representation of the network structure of the strategy network, while on the right side is a visual representation of an individual. The content of each row in Table 1 corresponds to one dimension within the individual. In summary, the CONVERTER encodes the strategy network into individuals, which then form a population to facilitate multi-agent reinforcement learning training.

In Table 1, Index 1 (

f c_{1} w e i g h t

) and Index 2 (

f c_{1} b i a s

) represent a set of parameters for a fully connected layer, i.e., weights and biases, mapping the input features to the hidden layer. Indices 3–8 pertain to a gated recurrent unit (GRU) [27], a type of gated recurrent neural network cell that involves three weight matrices corresponding to the reset gate, update gate, and calculation of the candidate hidden state. Specifically, Index 3 (

r n n_{1} w e i g h t

) and Index 4 (

r n n_{1} b i a s

) represent the weights and biases for the reset gate, Index 5 (

r n n_{2} w e i g h t

) and Index 6 (

r n n_{2} b i a s

) those for the update gate, and Index 7 (

r n n_{3} w e i g h t

) and Index 8 (

r n n_{3} b i a s

) the weights and biases related to the candidate hidden state. Index 9 (

f c_{2} w e i g h t

) and Index 10 (

f c_{2} b i a s

) similarly denote the weights and biases for another fully connected layer, which maps the hidden state to the probability distribution of the output actions.

4.2.2. Design of Fitness Function

In EAs, the fitness value is the primary indicator describing an individual’s performance. Individuals are selected for survival and reproduction based on their fitness values. Fitness serves as the driving force for evolutionary algorithms. The choice of the fitness function directly affects the convergence speed of the evolutionary algorithm and whether the optimal solution can be found. Establishing a reasonable fitness function for reinforcement learning problems is a key point in algorithm optimization. The fitness function is defined as the average return of DE-COMA over n training steps.

f i t n e s s = \frac{1}{n} \sum_{i = 0}^{n} r e w a r d_{i}

(3)

The specific components of the

r e w a r d

are given as follows.

r e w a r d = d e l t a_{e n e m y} + d e l t a_{d e a t h s} - d e l t a_{a l l y}

(4)

where

d e l t a_{e n e m y}

represents the cumulative damage reward inflicted by our agents on enemy agents,

d e l t a_{d e a t h s}

represents the reward for eliminating enemies, and

d e l t a_{a l l y}

represents the reward for our agents being eliminated.

4.2.3. DE Steps

The DE algorithm was inspired by genetic algorithms, and its algorithmic steps align with the traditional evolutionary algorithm approach. It primarily consists of three steps: mutation, crossover, and selection.

Mutation

v_{i, G + 1} = x_{r_{1}, G} + F \cdot (x_{r_{2}, G} - x_{r_{3}, G})

(5)

The randomly chosen integers r₁, r₂, and r₃ are also chosen to be different from the running index i, so that NP must be greater or equal to four to allow for this condition.

Crossover

In order to increase the diversity of the perturbed parameter vectors, crossover is introduced. To this end, the trial vector is:

u_{j i, G + 1} = \{\begin{matrix} v_{j i, G + 1} & i f (r a n d b (j) \leq C R) o r j = r n b r (i) \\ x_{j i, G} & i f (r a n d b (j) > C R) a n d j \neq r n b r (i) \\ j = 1, 2, \dots, D . \end{matrix},

(6)

Selection

To decide whether or not it should become a member of generation

G + 1

, the trial vector

u_{i, G + 1}

is compared to the target vector

x_{i, G}

using the greedy criterion. If vector

u_{i, G + 1}

yields a smaller cost function value than

x_{i, G}

, then x_i;

G + 1

is set to

u_{i, G + 1}

, otherwise, the old value

x_{i, G}

is retained.

4.3. Main Procedure of DE-COMA

The core of DE-COMA lies in optimizing the agent’s policy network with DE to enhance its convergence. The pseudocode of DE-COMA is given in Algorithm 1, and the flowchart is shown in Figure 5.

Algorithm 1. DE-COMA

Figure 6 represents the initialization phase of the entire algorithm. Here, the agent’s policy is abstracted into individuals within the population, and the fitness values of the population are initialized.

Figure 7 illustrates the training process of the reinforcement learning component. It is important to note that the actor shares a set of parameters, with the only distinction being the addition of the agent’s ID during input to differentiate the action outputs of different agents. Furthermore, to efficiently compute the counterfactual baseline, the input to the critic network includes the actions of other agents

u_{t}^{- a}

, the global state

s_{t}

, the local observations of this agent

o_{t}^{a}

, the agent’s ID, and the actions of all agents at the previous time step

u_{t - 1}

. This allows the critic network to directly output the counterfactual Q-values for each action of this agent.

Figure 8 describes the update of the reinforcement learning module, which is based on Figure 7.

Figure 9 illustrates the update module for differential evolution, where each individual in the population of each generation undergoes crossover, mutation, and selection operations to generate offspring populations and return the best individuals. It serves as the training core of DE-COMA.

5. Experiment Analysis

StarCraft II [28] is a real-time strategy game with built-in environments (2s_vs_1sc, 2s3z, 3m, 8m, etc.) that are highly suitable for validating algorithms related to controlling micro-management tasks. In order to further validate the performance of the proposed DE-COMA algorithm in this paper, the experiment utilizes the popular StarCraft Multi-Agent Challenge (SMAC) platform for validation. SMAC is based on the PySC2 (StarCraft II Learning Environment) and the StarCraft II API to create micro-operation environments. It incorporates advanced multi-agent algorithms such as Counterfactual Multi-Agent Policy Gradients (COMA) [8], Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement learning (QTRAN) [29], Multi-Agent Variational Exploration (MAVEN) [30], and others.

5.1. Evaluation Metrics and Fitness Function

The fitness function used for training the DE-COMA algorithm remains consistent with the one mentioned earlier in the text. In the comparative experiments, we continue to use the average return obtained by our agents within

n

training steps, and the win rate recorded every

e v a l u a t e_{p e r_{e p o c h}}

training steps. The win rate calculation method, denoted as

w i n_{r a t e}

, is as shown in Formula (5).

w i n_{r a t e} = \frac{w i n_{f l a g}}{e v a l u a t e_{p e r_{e p o c h}}}

(7)

where

w i n_{f l a g}

represents the number of victories within training steps. Specifically, when all agents achieve success in the decision-making process, the

w i n_{r a t e}

is set to 1. If they all fail, the

w i n_{r a t e}

is set to 0. In other situations, the

w i n_{r a t e}

is calculated as the ratio of

w i n_{f l a g}

to

e v a l u a t e_{p e r_{e p o c h}}

. Overall,

w i n_{r a t e}

is a value between 0 and 1.

5.2. Experimental Results Analysis

The parameter settings of multi-agent reinforcement learning algorithms in the StarCraft II environment are crucial. To ensure the objectivity and fairness of the experiments, we set the same parameters across different algorithms. The DE-COMA algorithm has four additional parameters: population size, evolution generations, mutation rate, and crossover rate. These are common parameters in evolutionary algorithms. Since DE-COMA is an improvement based on the differential evolution algorithm, only this algorithm includes these parameters. Parameter details are shown in Table 2.

5.2.1. Win Rate Analysis

In the win-rate-based analysis of this section, we record the win rates during the multi-agent training process. The win rate is calculated as shown in Equation (5).

The experimental results demonstrate that DE-COMA improves the low convergence of the COMA algorithm and outperforms other excellent algorithms in specific environments. The experiment was conducted for a total of 1e6 time steps with an evaluation generation of 5000.

The specific analysis is as follows: as shown in Figure 10a, in the 2s_vs_1sc environment, the convergence curve of the improved algorithm proposed in this paper does not exhibit significant fluctuations as seen in the original algorithm. This improvement resolves the issue of large fluctuations in the win rate observed in COMA, maintaining efficient performance in the later stages. As indicated in Table 3, for the 2s1sc scenario, DE-COMA shows a 21.06% increase in win rate compared to COMA. In this environment, we conducted a Friedman test on the recorded win rate data. The test results and rankings are shown in Table 4.

As shown in Figure 10b, in the 2s3z environment, DE-COMA addresses the convergence issue of the COMA algorithm, leading to a significant improvement in the win rate. However, there is still a gap compared to value-based algorithms (QMIX, VDN) due to the policy-based update approach. As depicted in Table 3, for the 2s3z scenario, DE-COMA exhibits a remarkable 74.6% increase in win rate compared to COMA. In this environment, we conducted a Friedman test on the recorded win rate data. The test results and rankings are shown in Table 5.

As illustrated in Figure 10c, in the 3m environment, the DE-COMA algorithm resolves the instability in convergence that COMA experiences in later stages. It also demonstrates the smallest standard deviation among all tested algorithms, approaching the performance of excellent value-based algorithms. As shown in Table 3, for the 3m scenario, DE-COMA yields a 12.35% improvement in win rate compared to COMA. In this environment, we conducted a Friedman test on the recorded win rate data. The test results and rankings are shown in Table 6.

In Figure 10d, in the 8m environment, our DE-COMA algorithm performs second only to the QMIX algorithm, maintaining an advantage in standard deviation compared to other algorithms. As indicated in Table 3, for the 8m scenario, DE-COMA shows a 12.84% improvement in win rate compared to COMA. In this environment, we conducted a Friedman test on the recorded win rate data. The test results and rankings are shown in Table 7.

Overall, the introduction of the differential evolution algorithm to optimize the COMA policy network has proven to be effective, outperforming many multi-agent reinforcement learning algorithms. However, since the COMA algorithm itself is policy-based, there is still a gap compared to other advanced value-based algorithms.

5.2.2. Average Return Analysis

In the experiment based on average return in this section, we use the same parameter settings and environment as in Section 5.2.1 Win Rate Analysis. Unlike win rate, average return is recorded by tracking the immediate rewards at each time step during each episode of agent training. The cumulative return is then divided by the number of time steps upon reaching a terminal state to obtain the average return. Similar to win rate, average return also reflects changes in the agent’s strategy during decision making. When more return is accumulated, it indicates that the agent interacts with the environment for a longer time. However, this does not directly imply a higher win rate, as the agent might exhibit lazy behavior, repeatedly performing actions that yield rewards but are irrelevant to the final objective.

Detailed experimental information based on average return can be found in Table 8.

As shown in Figure 11a, in the 2s_vs_1sc environment, there is considerable fluctuation in the algorithm, as indicated by the 2s1sc row in Table 8. The enhanced DE-COMA algorithm shows an average improvement of 13.29% in fitness-based evaluation criteria compared to the original COMA algorithm. The test results and rankings are shown in Table 9.

In Figure 11b, in the 2s3z environment, the proposed DE-COMA method resolves the poor convergence issue of the COMA algorithm. However, when evaluated using the win rate criterion, due to the strategy-based updating, there remains a certain gap compared to QMIX and COMA methods, as shown in the 2s3z row in Table 8. DE-COMA algorithm, in comparison to the original COMA algorithm, shows an improvement of 103.71%. The test results and rankings are shown in Table 10.

As depicted in Figure 11c, in the 3m environment, the introduced DE-COMA algorithm addresses the instability in the late convergence of the COMA algorithm. Although the COMA algorithm achieves good values in the initial stages, there is a significant drop in the later stages, approaching various value-based algorithms, as shown in the 3m row in Table 8. DE-COMA, compared to COMA, exhibits a 9.3% improvement based on fitness value evaluation criteria. In this environment, we conducted a Friedman test on the recorded win rate data. The test results and rankings are shown in Table 11.

In Figure 11d, in the 8m environment, our proposed DE-COMA algorithm outperforms the COMA algorithm and falls second only to the QMIX algorithm. As indicated in the 8m row in Table 8, the DE-COMA algorithm shows a 7.8% improvement compared to the original COMA algorithm. In this environment, we conducted a Friedman test on the recorded win rate data. The test results and rankings are shown in Table 12.

5.3. Diversity of Actions Analysis

In the previous section, we conducted experiments comparing DE-COMA with the original COMA algorithm and analyzed their performance in terms of win rate and cumulative reward. Since the primary motivation behind the improvement of DE-COMA is to enhance the diversity of the agents’ strategies, we will examine how the diversity of the DE-COMA algorithm’s strategies is manifested in different environments. Before delving into this analysis, let us briefly introduce the micro-scenes in StarCraft and the action space of intelligent agents in this context.

5.3.1. Micro-Scenarios

These scenarios include homogeneous or heterogeneous settings, where homogeneous indicates that each unit belongs to the same type (such as Marines). In such setups, winning strategies typically involve focusing on firing and ensuring the survival of our units. Heterogeneous settings, on the other hand, involve multiple types of units within our forces (such as Stalkers and Zealots). In this configuration, our intelligent agents must mitigate conflicts between roles to protect teammates from attacks. Table 13 is an overview of the scenarios used in this paper:

5.3.2. Action Space

The action set in the discrete space includes: move[direction] (four directions: north, south, east, or west), attack[enemy_id], stop, and no-op. Deceased agents can only perform the no-op action, while surviving agents cannot execute the no-op action. Depending on the scenario, agents can take between 7 and 70 different actions.

5.3.3. Experimental Results Analysis of Actions Diversity

In reinforcement learning, the strategy of an agent can be viewed as a mapping from states to actions. Therefore, when evaluating the diversity of strategies, we can sample the distribution of actions taken by the agent. Different micro-scenarios have varying numbers and content of actions available to agents. We store the frequency of each action taken by the agent and examine the probability distribution of actions taken by the agent in the early stages of algorithm training (first 100,000 time steps). In the case of the 2s3z scenario, where the agent has 11 possible actions, the action distribution is illustrated in Figure 12.

In the case of the 2s_vs_1sc scenario, where the agent has 7 possible actions, the action distribution is illustrated in Figure 13.

In the case of the 3m scenario, where the agent has 9 possible actions, the action distribution is illustrated in Figure 14.

In the case of the 8m scenario, where the agent has 14 possible actions, the action distribution is illustrated in Figure 15.

The above experimental results clearly demonstrate the advantages of DE-COMA over the original COMA algorithm. During the initial training phase, DE-COMA tends to adopt a diverse range of actions, reflecting its emphasis on comprehensive exploration of the environment. This extensive exploration of actions helps the intelligent agent gain a more comprehensive understanding of potential strategies and environmental feedback. However, as the training progresses, DE-COMA gradually adjusts its strategy, focusing more on actions that prove to be more successful in adversarial encounters. This dynamic learning process indicates that DE-COMA possesses the ability to adapt flexibly to different training stages, establishing a knowledge foundation through extensive action exploration in the early stages and subsequently improving win rates by selecting effective strategies in the later stages. This balanced learning approach enables DE-COMA to adjust its strategies more flexibly in response to complex and changing environments, ultimately enhancing performance more effectively.

6. Conclusions

To address the challenge of balancing parameter sharing and policy diversity in COMA under the CTDE framework, this paper proposes a DE-COMA algorithm that optimizes the COMA policy network with DE algorithm to enhance its convergence. Comparative experiments in four environments on the SMAC platform demonstrate that DE-COMA achieves a significant improvement in average win rate compared to the original COMA. In various aspects of simple and moderately difficult environments, DE-COMA outperforms excellent multi-agent algorithms such as VDN and QMIX. Furthermore, DE is used to optimize the neural networks of multi-agent algorithms, and this method can be applied to any multi-agent algorithm that utilizes deep neural networks. However, the DE-COMA algorithm still has limitations. Its overall performance is constrained by the COMA algorithm. In highly complex multi-agent environments, simple AC-structured methods like COMA may perform poorly, and training may become unstable. Therefore, in future work, it is worthwhile to explore how to improve agent coordination efficiency and algorithm stability in complex multi-agent environments. It is important to emphasize that the proposed intelligent algorithm concept is not only applicable to the COMA algorithm but can also be flexibly applied to other types of MARL.

Author Contributions

Methodology, S.Q. and M.L.; Validation, J.L.; Investigation, S.Q.; Data curation, R.G.; Writing—original draft, R.G.; Writing—review & editing, B.S.; Project administration, Z.C. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the National Foreign Expert Program of the Ministry of Science and Technology (Grant No. G2023041037L), the Shaanxi Natural Science Basic Research Project (Grant No. 2024JC-YBMS-502), and the Science and Technology Program of Xi’an, China (Grant No. 23ZDCYJSGG0018-2023).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bai, Y.; Zhao, H.; Zhang, X.; Chang, Z.; Jäntti, R.; Yang, K. Towards Autonomous Multi-UAV Wireless Network: A Survey of Reinforcement Learning-Based Approaches. IEEE Commun. Surv. Tutor. 2023, 25, 3038–3067. [Google Scholar] [CrossRef]
Li, Y.; Liu, I.J.; Yuan, Y.; Chen, D.; Schwing, A.; Huang, J. Accelerating distributed reinforcement learning with in-switch computing. In Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; pp. 279–291. [Google Scholar]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement Learning Algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef] [PubMed]
Oroojlooy, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Vlassis, N. Q-value functions for decentralized POMDPs. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, Honolulu, HI, USA, 14–18 May 2007; pp. 1–8. [Google Scholar]
Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2681–2690. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation. Sensors 2023, 23, 3762. [Google Scholar] [CrossRef]
Wolpert, D.H.; Tumer, K. Optimal payoff functions for members of collectives. Adv. Complex Syst. 2001, 4, 265–279. [Google Scholar] [CrossRef]
Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 2015, 20, 606–626. [Google Scholar] [CrossRef]
Khadka, S.; Tumer, K. Evolution-guided policy gradient in reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1196–1208. [Google Scholar]
Bai, H.; Cheng, R.; Jin, Y. Evolutionary Reinforcement Learning: A Survey. Intell. Comput. 2023, 2, 0025. [Google Scholar] [CrossRef]
Wang, X.; Zhang, Z.; Zhang, W. Model-based multi-agent reinforcement learning: Recent progress and prospects. arXiv 2022, arXiv:2203.10603. [Google Scholar]
Chen, Z.; Nian, X.; Meng, Q. Nash equilibrium seeking of general linear multi-agent systems in the cooperation–competition network. Syst. Control. Lett. 2023, 175, 105510. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Leibo, J.Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. arXiv 2017, arXiv:1702.03037. [Google Scholar]
Whitley, D.; Dominic, S.; Das, R.; Anderson, C.W. Genetic reinforcement learning for neurocontrol problems. Mach. Learn. 1993, 13, 259–284. [Google Scholar] [CrossRef]
Stanley, K.O.; Miikkulainen, R. Evolving neural networks through augmenting topologies. Evol. Comput. 2002, 10, 99–127. [Google Scholar] [CrossRef] [PubMed]
Cully, A.; Clune, J.; Tarapore, D.; Mouret, J.B. Robots that can adapt like animals. Nature 2015, 521, 503–507. [Google Scholar] [CrossRef]
Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv 2017, arXiv:1703.03864. [Google Scholar]
Usunier, N.; Synnaeve, G.; Lin, Z.; Chintala, S. Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks. arXiv 2016, arXiv:1609.02993. [Google Scholar]
Peng, P.; Wen, Y.; Yang, Y.; Yuan, Q.; Tang, Z.; Long, H.; Wang, J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. arXiv 2017, arXiv:1703.10069. [Google Scholar]
Jorge, E.; Kågebäck, M.; Johansson, F.D.; Gustavsson, E. Learning to play guess who? and inventing a grounded language as a consequence. arXiv 2016, arXiv:1611.03218. [Google Scholar]
Devlin, S.; Yliniemi, L.; Kudenko, D.; Tumer, K. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, Paris, France, 5–9 May 2014; pp. 165–172. [Google Scholar]
Das, S.; Mullick, S.S.; Suganthan, P.N. Recent advances in differential evolution—An updated survey. Swarm Evol. Comput. 2016, 27, 1–30. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Samvelyan, M.; Rashid, T.; De Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.; Hung, C.M.; Torr, P.H.; Foerster, J.; Whiteson, S. The starcraft multi-agent challenge. arXiv 2019, arXiv:1902.04043. [Google Scholar]
Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
Mahajan, A.; Rashid, T.; Samvelyan, M.; Whiteson, S. Maven: Multi-agent variational exploration. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]

Figure 1. CTDE framework vs. DTDE framework.

Figure 2. Parameter sharing model.

Figure 3. Parameter independence model.

Figure 4. Policy network coding method.

Figure 5. Flowchart of DE-COMA.

Figure 6. DE-COMA initialization module.

Figure 7. DE-COMA interaction module.

Figure 8. DE-COMA RL update module.

Figure 9. DE-COMA differential evolution update module.

Figure 10. Training convergence plots of win rates for six algorithms on StarCraft II 2s_vs_1sc (a), 2s3z (b), 3m (c), 8m (d).

Figure 11. Training convergence plots of average return for six algorithms on StarCraft II 2s_vs_1sc (a), 2s3z (b), 3m (c), 8m (d).

Figure 12. 2s3z scenario action sampling.

Figure 13. 2s_vs_1sc scenario action sampling.

Figure 14. 3m scenario action sampling.

Figure 15. 8m scenario action sampling.

Table 1. Individual coding information.

Num	Definition	Size ([row, col])	Symbol
1	fc1.weight	[input_shape,rnn_hidden_dim]	$f c_{1} w e i g h t$
2	fc1.bias	[rnn_hidden_dim]	$f c_{1} b i a s$
3	reset_gate_weight	[rnn_hidden_dim, rnn_hidden_dim]	$r n n_{1} w e i g h t$
4	reset_gate_bias	[rnn_hidden_dim]	$r n n_{1} b i a s$
5	update_gate_weight	[rnn_hidden_dim, rnn_hidden_dim]	$r n n_{2} w e i g h t$
6	update_gate_bias	[rnn_hidden_dim]	$r n n_{2} b i a s$
7	con_hidden_status.weight	[rnn_hidden_dim, rnn_hidden_dim]	$r n n_{3} w e i g h t$
8	con_hidden_status.bias	[rnn_hidden_dim]	$r n n_{3} b i a s$
9	fc2.weight	[rnn_hidden_dim, n_actions]	$f c_{2} w e i g h t$
10	fc2.bias	[n_actions]	$f c_{2} b i a s$

Table 2. Experimental algorithm parameter table.

Name	Variable	DE-COMA	COMA	VND	QMIX	QTRAN	MAVEN
Training epochs	n_timestpes	1 × 10⁶	1 × 10⁶	1 × 10⁶	1 × 10⁶	1 × 10⁶	1 × 10⁶
Evaluation generations	evaluate_epoch	5000	5000	5000	5000	5000	5000
Replay buffer size	batch_size	32	32	32	32	32	32
Greedy rate	epsilon	0.5	0.5	1	1	1	1
Population size	NP	10	-	-	-	-	-
Evolution generations	rounds	5	-	-	-	-	-
Mutation rate	factor	0.7	-	-	-	-	-
Crossover rate	CR	0.7	-	-	-	-	-
Decay rate	min_epsilon	0.05	0.02	0.05	0.05	0.05	0.05
Discount rate	gamma	0.99	0.99	0.99	0.99	0.99	0.99

Table 3. Win rate data for six algorithms on StarCraft II 2s_vs_1sc, 2s3z, 3m, 8m.

Env	Results	DE-COMA	COMA	QMIX	VDN	QTRAN	MAVEN
2s1sc	Mean	5.69 × 10⁻⁰¹	4.70 × 10⁻⁰¹	7.21 × 10⁻⁰¹	8.75 × 10⁻⁰¹	8.36 × 10⁻⁰¹	6.63 × 10⁻⁰¹
	Std	3.13 × 10⁻⁰¹	3.46 × 10⁻⁰¹	3.59 × 10⁻⁰¹	2.78 × 10⁻⁰¹	3.16 × 10⁻⁰¹	3.27 × 10⁻⁰¹
	Max	1.00 × 10⁰⁰	6.00 × 10⁻⁰¹	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	9.50 × 10⁻⁰¹
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰
2s3z	Mean	4.41 × 10⁻⁰¹	2.50 × 10⁻⁰¹	7.75 × 10⁻⁰¹	7.42 × 10⁻⁰¹	4.70 × 10⁻⁰¹	5.36 × 10⁻⁰¹
	Std	1.97 × 10⁻⁰¹	1.14 × 10⁻⁰¹	2.61 × 10⁻⁰¹	2.38 × 10⁻⁰¹	2.70 × 10⁻⁰¹	3.19 × 10⁻⁰¹
	Max	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰
3m	Mean	7.50 × 10⁻⁰¹	6.68 × 10⁻⁰¹	7.66 × 10⁻⁰¹	7.46 × 10⁻⁰¹	7.61 × 10⁻⁰¹	7.11 × 10⁻⁰¹
	Std	1.68 × 10⁻⁰¹	1.70 × 10⁻⁰¹	2.76 × 10⁻⁰¹	2.49 × 10⁻⁰¹	2.70 × 10⁻⁰¹	3.77 × 10⁻⁰¹
	Max	1.00 × 10⁰⁰	9.50 × 10⁻⁰¹	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰
8m	Mean	7.91 × 10⁻⁰¹	7.01 × 10⁻⁰¹	8.18 × 10⁻⁰¹	7.37 × 10⁻⁰¹	7.52 × 10⁻⁰¹	7.06 × 10⁻⁰¹
	Std	2.42 × 10⁻⁰¹	2.34 × 10⁻⁰¹	2.37 × 10⁻⁰¹	2.80 × 10⁻⁰¹	2.37 × 10⁻⁰¹	2.82 × 10⁻⁰¹
	Max	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰	1.00 × 10⁰⁰
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰

Table 4. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 2s_vs_1sc win rate.

Algorithms	Average Ranking	Final Rank
DE-COMA	3.87	4
COMA	4.63	6
QMIX	3.36	3
VDN	2.25	1
QTRAN	2.53	2
MAVEN	3.96	5

Table 5. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 2s3z win rate.

Algorithms	Average Ranking	Final Rank
DE-COMA	4.19	5
COMA	5.66	6
QMIX	1.63	1
VDN	2.06	2
QTRAN	3.86	4
MAVEN	3.59	3

Table 6. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 3m win rate.

Algorithms	Average Ranking	Final Rank
DE-COMA	3.19	4
COMA	4.74	6
QMIX	3.07	2
VDN	3.46	5
QTRAN	3.17	3
MAVEN	3.02	1

Table 7. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 8m.

Algorithms	Average Ranking	Final Rank
DE-COMA	3.52	2
COMA	3.76	5
QMIX	2.43	1
VDN	3.56	3
QTRAN	3.69	4
MAVEN	4.04	6

Table 8. Average return data for six algorithms on StarCraft II 2s_vs_1sc, 2s3z, 3m, 8m.

Env	Results	DE-COMA	COMA	QMIX	VDN	QTRAN	MAVEN
2s1sc	Mean	1.04 × 10⁰¹	9.18 × 10⁰⁰	1.30 × 10⁰¹	1.53 × 10⁰¹	1.46 × 10⁰¹	1.20 × 10⁰¹
	Std	5.06 × 10⁰⁰	5.46 × 10⁰⁰	5.79 × 10⁰⁰	4.80 × 10⁰⁰	5.34 × 10⁰⁰	5.39 × 10⁰⁰
	Max	1.93 × 10⁰¹	1.90 × 10⁰¹	1.95 × 10⁰¹	1.95 × 10⁰¹	1.95 × 10⁰¹	1.95 × 10⁰¹
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰
2s3z	Mean	8.21 × 10⁰⁰	3.53 × 10⁰⁰	1.48 × 10⁰¹	1.42 × 10⁰¹	9.32 × 10⁰⁰	1.05 × 10⁰¹
	Std	3.64 × 10⁰⁰	2.18 × 10⁰⁰	4.83 × 10⁰⁰	4.46 × 10⁰⁰	4.94 × 10⁰⁰	5.88 × 10⁰⁰
	Max	1.99 × 10⁰¹	1.17 × 10⁰¹	1.98 × 10⁰¹	1.96 × 10⁰¹	1.88 × 10⁰¹	1.89 × 10⁰¹
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰
3m	Mean	1.41 × 10⁰¹	1.29 × 10⁰¹	1.46 × 10⁰¹	1.41 × 10⁰¹	1.44 × 10⁰¹	1.36 × 10⁰¹
	Std	3.28 × 10⁰⁰	3.25 × 10⁰⁰	5.19 × 10⁰⁰	4.68 × 10⁰⁰	5.00 × 10⁰⁰	6.92 × 10⁰⁰
	Max	1.95 × 10⁰¹	1.88 × 10⁰¹	1.97 × 10⁰¹	1.97 × 10⁰¹	1.96 × 10⁰¹	1.98 × 10⁰¹
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰
8m	Mean	1.47 × 10⁰¹	1.43 × 10⁰¹	1.56 × 10⁰¹	1.41 × 10⁰¹	1.43 × 10⁰¹	1.35 × 10⁰¹
	Std	4.36 × 10⁰⁰	4.43 × 10⁰⁰	4.43 × 10⁰⁰	5.25 × 10⁰⁰	4.45 × 10⁰⁰	5.17 × 10⁰⁰
	Max	1.98 × 10⁰¹	1.98 × 10⁰¹	1.98 × 10⁰¹	1.98 × 10⁰¹	1.97 × 10⁰¹	1.98 × 10⁰¹
	Min	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰	0.00 × 10⁰⁰

Table 9. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 2s_vs_1sc average return.

Algorithms	Average Ranking	Final Rank
DE-COMA	4.17	5
COMA	4.49	6
QMIX	3.34	3
VDN	2.46	1
QTRAN	2.66	2
MAVEN	3.87	4

Table 10. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 2s3z average return.

Algorithms	Average Ranking	Final Rank
DE-COMA	4.20	5
COMA	5.68	6
QMIX	1.66	1
VDN	2.04	2
QTRAN	3.85	4
MAVEN	3.58	3

Table 11. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 3m average return.

Algorithms	Average Ranking	Final Rank
DE-COMA	3.68	5
COMA	4.70	6
QMIX	3.03	1
VDN	3.50	4
QTRAN	3.04	2
MAVEN	3.05	3

Table 12. Average ranking of DE-COMA, COMA, QMIX, VDN, QTRAN, MAVEN according to the Friedman test on 8m average return.

Algorithms	Average Ranking	Final Rank
DE-COMA	3.49	2
COMA	3.71	5
QMIX	2.51	1
VDN	3.54	3
QTRAN	3.67	4
MAVEN	4.08	6

Table 13. SMAC scenarios.

Name	All Units	Enemy Units	Type
2s3z	2 Stalkers and 3 Zealots	2 Stalkers and 3 Zealots	Heterogeneous and Symmetric
2s_vs_1sc	2 Stalkers	1 Spine Crawler	Micro-Trick: Alternating Fire
3m	3 Marines	3 Marines	Homogeneous and Symmetric
8m	8 Marines	8 Marines	Homogeneous and Symmetric

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qu, S.; Guo, R.; Cao, Z.; Liu, J.; Su, B.; Liu, M. An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm. Appl. Sci. 2024, 14, 8383. https://doi.org/10.3390/app14188383

AMA Style

Qu S, Guo R, Cao Z, Liu J, Su B, Liu M. An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm. Applied Sciences. 2024; 14(18):8383. https://doi.org/10.3390/app14188383

Chicago/Turabian Style

Qu, Shaochun, Ruiqi Guo, Zijian Cao, Jiawei Liu, Baolong Su, and Minghao Liu. 2024. "An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm" Applied Sciences 14, no. 18: 8383. https://doi.org/10.3390/app14188383

APA Style

Qu, S., Guo, R., Cao, Z., Liu, J., Su, B., & Liu, M. (2024). An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm. Applied Sciences, 14(18), 8383. https://doi.org/10.3390/app14188383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Effective Training Method for Counterfactual Multi-Agent Policy Network Based on Differential Evolution Algorithm

Abstract

1. Introduction

2. Related Work

3. Background

3.1. Fully Cooperative Tasks under Partial Observability

3.2. Counterfactual Baseline

3.3. Differential Evolution

4. Training Method of COMA Policy Network Based on Differential Evolution

4.1. COMA Policy Network

4.2. Optimization of COMA Policy Network with DE

4.2.1. Individual Encoding

4.2.2. Design of Fitness Function

4.2.3. DE Steps

4.3. Main Procedure of DE-COMA

5. Experiment Analysis

5.1. Evaluation Metrics and Fitness Function

5.2. Experimental Results Analysis

5.2.1. Win Rate Analysis

5.2.2. Average Return Analysis

5.3. Diversity of Actions Analysis

5.3.1. Micro-Scenarios

5.3.2. Action Space

5.3.3. Experimental Results Analysis of Actions Diversity

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI