Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization

Zhang, Bo; Tan, Wen Jun; Cai, Wentong; Zhang, Allan N.

doi:10.3390/su16229996

Open AccessArticle

Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization

¹

College of Computing and Data Science, Nanyang Technological University, 50 Nanyang Ave, Singapore 639798, Singapore

²

Singapore Institute of Manufacturing Technology (SIMTech), Agency for Science, Technology and Research (A*STAR), 5 CleantTech Loop, Singapore 636732, Singapore

³

School of Mechanical and Aerospace Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore 639798, Singapore

^*

Authors to whom correspondence should be addressed.

Sustainability 2024, 16(22), 9996; https://doi.org/10.3390/su16229996

Submission received: 1 October 2024 / Revised: 8 November 2024 / Accepted: 12 November 2024 / Published: 16 November 2024

(This article belongs to the Special Issue Resilient Supply Chains, Green Logistics, and Digital Transformation)

Download

Browse Figures

Versions Notes

Abstract

:

In today’s volatile supply chain (SC) environment, competition has shifted beyond individual companies to the entire SC ecosystem. Reducing overall SC costs is crucial for success and benefits all participants. One effective approach to achieve this is through digital transformation, enhancing SC coordination via information sharing, and establishing decision policies among entities. However, the risk of unauthorized leakage of sensitive information poses a significant challenge. We aim to propose a Privacy-preserving Multi-agent Reinforcement Learning (PMaRL) method to enhance SC visibility, coordination, and performance during inventory management while effectively mitigating the risk of information leakage by leveraging machine learning techniques. The SC inventory policies are optimized using multi-agent reinforcement learning with additional SC connectivity information to improve training performance. The simulation-based evaluation results illustrate that the PMaRL method surpasses traditional optimization methods in achieving cost performance comparable to full visibility methods, all while preserving privacy. This research addresses the dual objectives of information security and cost reduction in SC inventory management, aligning with the broader trend of digital transformation.

Keywords:

digital transformation; data-driven decision making; supply chain inventory management; multi-agent reinforcement learning; privacy preserving

1. Introduction

In recent years, there has been increasing research regarding supply chain (SC) coordination and information sharing aimed at improving overall SC operation through digital processes [1,2,3]. Digital transformation empowers SCs to utilize data-driven decision making in order to increase coordination and visibility. It has proven beneficial in enhancing operational profitability, service quality, and mitigating demand uncertainties for individual firms as well as the SC ecosystem [3,4,5]. During the COVID-19 pandemic, SC uncertainties rose, making the SC even more fragile [6]. Furthermore, with the fierce and various global competition and the increasing complexity of SC networks, coordination and visibility play an increasingly crucial role in reducing the overall cost and uncertainties to all entities in the SC [7,8,9]. However, achieving effective coordination and information sharing is challenging. Firms usually seek to improve their performance by centralized coordination and decision making within complex SC networks [1].

Data-driven coordination in SCs can be categorized into two main types: full information sharing and partial information sharing [10]. Full information sharing involves gathering data from various firms and centralizing them with a trusted party. A decision model is then developed on a central server using historical data to make decisions for all firms using the real-time data [10]. Since decisions are made for all firms using real-time data, it requires coordination of information at every step. On the other hand, in partial information sharing, for example, each SC firm receives the demand forecast of its downstream firms and aims to meet this demand as fully as possible, utilizing its inventory policy model to generate its own demand prediction. However, the error in demand prediction inevitably accumulates over time. Moreover, although the above-mentioned approaches to collaborating and sharing information offer advantages to firms, they also expose firms to potential risks that the shared information may be leaked. The information leakage may lead to financial loss, reputational setback, and negative brand image [11,12]. Generally speaking, current approaches improve SC visibility, but they come with heightened concerns regarding information security. As a result, protecting the privacy of shared information is a major obstacle to SC coordination and impedes the digital transformation of SC optimization.

The development of multi-agent reinforcement learning (MaRL) presents a promising data-driven decision-making solution for improving SC coordination and mitigating the risks of information leakage. MaRL achieves great success in solving various problems. It involves optimizing the behavior of multiple agents within a shared environment [13]. Data can be gathered from the SC’s insight to train the MaRL system for decision making. Each agent in MaRL represents a firm in the SC, allowing each firm to make its own decisions. A key feature of MaRL that helps to preserve data privacy is decentralized decision making [14]. Once the decision model has been trained, each learning agent can make decisions independently. This eliminates the need to share information at every decision step, thereby reducing the risk of information leakage.

We present a Privacy-preserving Multi-agent Reinforcement Learning (PMaRL) approach which balances privacy concerns and SC performance. In line with the broader trend of digital transformation, PMaRL trains in a semi-private manner, requiring firms to share information only during the training. This includes shared embedding data, connectivity details, and training loss among firms. Based on this shared information, each firm can train its own policy and critic networks without disclosing any raw information. The policy networks are used to make decisions based on the current state and the critic networks assess the value of decisions made by the policy networks and guide its parameter updates. Once the policy and critic networks have been trained for all firms, each firm can then make decisions independently using its local state.

In addition, PMaRL incorporates topology information into the models to guide agent training to focus on useful shared information. As a result, PMaRL outperforms the MaRL methods that do not utilize network information in terms of training time, convergence speed, and final optimization objective.

The contributions of our work are as follows:

Our proposed method only requires sharing limited information during the training stage, overcoming the limitations of existing approaches that require full information visibility across SC.
The proposed method incorporates SC network topology to optimize agent training by classifying SC firms into distinct groups and leveraging their relationship to enhance the training process.
Simulation-based evaluations are used to demonstrate the superior performance of PMaRL compared to current optimization methods on practical scenarios of real-world SC networks.

The remaining sections are organized as follows: Section 2 presents related works on information sharing, inventory optimization approaches, and MaRL. Section 3 describes the problem statement of SC optimization. Section 4 presents the details of the PMaRL method. Section 5 evaluates the performance of PMaRL against the existing methods using SC simulation. Section 6 concludes our paper and discusses limitations and future work.

2. Literature Review

2.1. Information Sharing

Multi-stage inventory optimization is a well-studied problem since it is fundamental to studying bullwhip effects, operation cost reduction, and firm cooperation in SCs [15,16]. In SC optimization, information sharing has been viewed as a significant strategy to counter the bullwhip effect and reduce inventory cost [17]. For instance, downstream firms share their projected demand quantity to minimize the disturbances in the SC. This collaborative approach allows all firms to benefit from increased revenues, more agile demand planning, and reduced SC risk [18]. Highlighting the importance of sharing forecasting information within the SC, studies have demonstrated that such sharing results in lower demand prediction errors [19]. Furthermore, information sharing also helps to smooth out the demand curve [17,20].

However, sharing information among different firms introduces the risk of sensitive information leakage, which is a significant challenge for SCs [21,22]. This issue is compounded by the need to balance data privacy with effective collaboration. During the training process, it is crucial to determine what information can be shared without potentially exposing sensitive data. Additionally, it is important to assess whether the distributed decision-making process can increase the benefits of individual firm autonomy while reducing the likelihood of information leakage and potential negative consequences.

2.2. Inventory Optimization Approach

Existing approaches to SC inventory optimization are varied, including stochastic programming, heuristic search, and reinforcement learning (RL) [23,24]. SC inventory optimization involves setting inventory levels to minimize costs and maximize overall efficiency. SC networks can be classified into single-chain and multi-stage networks. A single-chain network is a linear network; whereas a multi-stage SC consists of multiple tiers or levels of SC entities. SC inventory optimization is confronted by challenges such as uncertain demand, complex SC networks, and complicated interactions among SC entities.

To tackle the single-chain inventory problem, Barlas and Gunduz [17] proposed both optimal and heuristic approaches. For multi-stage SC optimization problems, Rong et al. [25] utilized recursive optimization and decomposition aggregation heuristics. In practice, multi-stage SC networks are complex [26] and finding optimal solutions for these complex networks is challenging due to the curse of dimensionality [8,16,23]. Given uncertain demand and complex multi-stage networks, only the optimal solution for the single-chain network can be computed in polynomial time [27,28]. Achieving optimal policy for multi-stage SCs using stochastic programming remains difficult [29]. Existing approaches in such cases are NP-hard, with computational time increasing exponentially with model complexity [28].

2.3. Reinforcement Learning

In an increasingly unstable and complex SC environment, managing SC inventory continues to be challenging in many areas, such as decision making in continuous and discrete processes, the increasing amount of process data, and the intricate external SC environment [7]. Advances in computational resources and machine learning algorithms offer new potential for dynamic, data-driven decision making [30]. Specifically, RL approaches provide end-to-end data-driven solutions for managing SC inventory.

RL primarily focuses on optimizing the actions of either a single agent or multiple agents, typically within simulated environments. In single-agent RL, an agent learns an action policy by exploring and interacting with the environment. In contrast, MaRL involves agents not only interacting with the environment but also developing strategies for competition or cooperation with other agents. This distinct feature of MaRL sets it apart from single-agent RL, as it necessitates that agents account for the dynamics of the multi-agent system and adapt their policies accordingly.

In single-agent RL, Barat et al. [13] leveraged the actor–critic approach in an SC replenishment scenario, focusing on minimizing wastage and backorder quantities within a firm. Additionally, a deep Q-network was introduced to optimize inventory policies, specifically in the context of the beer game [31]. While single-agent RL approaches excel at optimizing the cost for individual entities within the SC, they achieve this independently. However, when each entity optimizes its operation without cooperation, this can lead to a “prisoner’s dilemma” situation [32], potentially degrading overall SC performance. The main drawback of single-agent RL is its limited focus on optimizing the performance of a single entity, often neglecting the broader impact on the entire SC.

In MaRL, studies by Dehaybe et al. [33], Kotecha and Chanona [34], and Nurkasanah [35] applied deep RL approaches to optimize overall SC performance in inventory management and replenishment strategies. These approaches require raw data, such as inventory level and order quantity, from SC entities to train a shared model, which is then used by each SC entity to make informed decisions. While these approaches help SC entities avoid the “prisoner’s dilemma”, they necessitate the sharing of original raw data and even decision models among entities. The primary goal of MaRL is to find optimized policies for multi-agents within a system, but it does not account for data privacy concerns. Although MaRL demonstrates exceptional performance in SC optimization, the issue of privacy remains a significant concern.

3. Research Objective

In this section, we describe the problem settings for multi-stage SC optimization. In many instances, SC optimization occurs under the conditions of incomplete information. For example, some firms may not be willing to share their data or business objectives. Additionally, customer demand is often unpredictable and subject to fluctuations. As a result, obtaining complete information during the optimization process is typically impractical.

The objective of SC optimization is to minimize the inventory costs across the entire SC. A multi-stage SC has the following characteristics [32]:

Cooperation with minor competition;
Incomplete information;
Prisoners’ dilemmas.

In a multi-stage SC, the optimal strategy for firms is to coordinate with one another to avoid competition. When firms act individually, such as by ordering excessive quantities to reduce short-term cost, it ultimately results in higher overall SC costs and increasing long-term costs for individual firms. Moreover, in a multi-stage SC, incomplete information scenarios are common, where firms can only access their own data. The inventory management decisions of each firm also affect others, leading to “prisoner’s dilemma” situations. In these cases, firms must consider the potential actions of others and cooperate accordingly.

Figure 1 shows an instance of a multi-stage SC where each stage consists of a single entity. The SC network consists of several entities, namely, the retailer, wholesaler, distributor, manufacturer, and supplier. Each of these entities is represented by an agent. In a multi-stage SC, upstream agents (e.g., wholesaler) receive stochastic orders from downstream agents (e.g., retailer). In response, they fulfill orders by sending shipments to the downstream agents.

To simulate real-world scenarios, each agent in the network experiences varying order and shipping lead times. Furthermore, each agent within the SC has its own distinct holding and backorder costs, reflecting the unique cost considerations for each agent.

In the multi-stage SC,

C_{i}^{t}

is the SC cost at time t for agent i:

C_{i}^{t} = c_{i}^{b} B O_{i}^{t} + c_{i}^{h} I L_{i}^{t},

(1)

where

c_{b}^{i}

and

c_{h}^{i}

are the backorder and holding cost coefficients for agent i, respectively.

B O_{i}^{t}

and

I L_{i}^{t}

are the backorder and inventory levels for agent i at time t, respectively.

Then, the long-run system-wide cost Z in the following function that we aim to minimize is

Z (π) = \sum_{t = 1}^{T} \sum_{i = 1}^{n} C_{i}^{t},

(2)

where

π

is the set of order policies for all agents and T indicates the maximum number of time steps of the system.

The objective of the SC optimization problem is to find a policy set

π^{*}

that minimizes the expected systemic SC cost Z:

π^{*} = \underset{π \in Π}{arg min} E [Z (π)],

(3)

where

Π

is all possible order policy sets. In the next section, we elaborate on how the PMaRL can be used to optimize the SC costs.

4. Methods

In this section, we explore the optimization of the overall inventory cost in a multi-stage SC. To tackle this challenge, we introduce a PMaRL model that incorporates network topology information. The model is trained in a centralized manner to enable effective coordination among agents, but during the decision-making phase it operates in a decentralized manner to ensure privacy. Each agent has its own critic network, which includes an encoder, decoder, and the shared attention structure. During training, agents exchange the embedding information and corresponding loss value. While some raw data could potentially be reconstructed through reverse engineering, this approach still offers a viable solution for preserving data privacy.

4.1. Fundamentals

A multi-stage SC can be considered as a special case of a Markov decision process, defined by a tuple

(S, A, P_{A}, R_{A})

, where S is a set of states;

A = {A_{1}, \dots, A_{n}}

,

A_{i}

is the action set of agent i;

P_{A}

is the probability distribution of all possible next states given state S and action set A:

S \times A_{1} \times \dots \times A_{n} \to P_{A} (S)

;

R_{A} = {r_{1}, \dots, r_{n}}

,

r_{i}

is the expected immediate reward of agent i:

S \times A_{i} \to r_{i}

; and n is the total number of agents.

Furthermore, each agent i has its own observation,

o_{i}

, representing its observable partial information from the global state

s \in S

; and its own policy,

π_{i} : o_{i} \to P (A_{i})

, corresponding to the action probability distribution under the agent’s observation. Agents optimize policy by maximizing their expected returns:

J_{i} (π_{i}) = E_{a_{i} \sim π_{i}, \dots, a_{n} \sim π_{n}} [\sum_{t = 0}^{\infty} γ^{t} r_{i}^{t} (s^{t}, a_{i}^{t}, \dots, a_{n}^{t})],

(4)

where

a_{i} \in A_{i}

and

γ \in [0, 1]

is the discount factor that determines the extent to which the policy prioritizes immediate rewards over long-term benefits.

The gradient of agent policy networks can be estimated using the policy gradient technique [36] as follows:

▽_{θ} J (π_{θ}) = ▽_{θ} \log (π_{θ} (a_{t} | s_{t})) \sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} r_{t^{'}} (s_{t^{'}}, a_{t^{'}}),

(5)

where

θ

represents the learnable parameter. In practice, especially in complex scenarios, it is hard to fully exploit and explore the entire environment to obtain the precise value of the term

\sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} r_{t^{'}} (s_{t^{'}}, a_{t^{'}})

as the expected returns in the policy gradient estimator can vary greatly in different training episodes. To address this issue, actor–critic methods [37] are used to augment the original policy gradient approach. These estimate the expected returns using the Q-value function:

Q (s_{t}, a_{t}) = E [\sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} r_{t^{'}} (s_{t^{'}}, a_{t^{'}})] .

(6)

The Q-value function can be learned by a temporal-difference learning technique that minimizes the regression loss of the past Q-value and the reward:

L_{Q} = E_{s, a, r, s^{'} \sim D} [{(Q (s, a) - y)}^{2}],

(7)

y = r (s, a) + γ E_{a^{'} \sim π (s^{'})} [Q (s^{'}, a^{'})],

(8)

where D is a replay memory storing past experiences.

To encourage the RL agents to explore the environment and avoid premature convergence at a local optimal policy, Haarnoja et al. [38] proposes the soft actor–critic approach that introduces the entropy term

l o g (π (a | s))

and incorporates it into the policy gradient:

▽_{θ} J (π_{θ}) = E [▽_{θ} \log (π_{θ} (a | s)) (Q (s, a) - α \log (π_{θ} (a | s)) - b (s))],

(9)

where

b (s)

is the baseline of the Q-value function that measures the value of the current state s, and

α

is a temperature parameter. Accordingly, the loss function of the Q-value is revised by adding the entropy term to Equation (8):

y = r (s, a) + γ E_{a^{'} \sim π (s^{'})} [Q (s^{'}, a^{'}) - α l o g (π (a^{'} | s^{'}))] .

(10)

The actor–critic methods involve updating the policy network and critic network using Equations (7) and (9) iteratively. The policy network makes decisions based on the current state and the critic network estimates the expected return of the policy network using the Q-value function.

4.2. Policy Networks

In the PMaRL method, all agents train in a centralized manner and make decisions based on their own local information. Each agent only observes local information during decision making and learns to cooperate during the training process. Each agent has its own policy network and the critic network is partially shared among all the agents. To learn agents’ policies, we propose an approach using MaRL and an SC network topology and we provide further details on this approach below.

State variables: In a multi-stage SC, agents are required to make decisions at every time step. The agent i interacts with other agents and the environment by observing state

o_{i}^{t}

, taking action

a_{i}^{t}

, observing new states

o_{i}^{t + 1}

, and obtaining reward

r_{i}^{t}

. The local state of agent i,

s_{i}^{t}

, is a tuple

(B O_{i}^{t}, I L_{i}^{t}, O O_{i}^{t}, A S_{i}^{t}, A O_{i}^{t})

, where

B O_{i}^{t}

,

I L_{i}^{t}

,

O O_{i}^{t}

,

A S_{i}^{t}

, and

A O_{i}^{t}

are the backorder level, inventory level, on-order items, arrival shipment, and arrival order at time t of agent i, respectively. The observation

o_{i}^{t}

of the agent i at time step t contains its local states during the last m periods: (

s_{i}^{t - m}, s_{i}^{t - m + 1}, \dots, s_{i}^{t}

). m is set to 5 in our experiments [14].

Action space: In the actual scenario, each agent can order an arbitrary quantity,

[0, \infty)

, to its upstream agent(s). In MaRL, however, the policy networks output with finite size. Hence, the order quantity needs to be constrained to a finite action space,

A

, which represents all possible actions an RL agent can take in the environment. To increase the robustness and stability of the policy network, we define that each RL agent sends order quantity

d + a_{i}

to its upstream agent(s), where d is the total demand quantity from its downstream agent(s) [31] and

a_{i}

is the output of agent i’s policy network. To regulate the demand fluctuation of the downstream agent(s), the policy network uses

a_{i}

to control how much to order based on d. After the total order quantity is determined by

d + a_{i}

, it is then equally divided by the number of upstream agents and sent to them accordingly.

Reward function: At each time step t, agent i observes its own state

o_{i}^{t}

and selects an action

a_{i}^{t}

. The environment then provides a reward

r_{i}^{t}

and generates a new state

s_{i}^{t + 1}

. To minimize the overall SC cost,

r_{i}^{t}

is measured based on the agent’s cost at time step t, which includes the sum of inventory holding and backorder costs. However, due to order and shipment lead times in a multi-stage SC, the reward is not immediately observed after taking action

a_{i}^{t}

. Moreover, the reward

r_{i}^{t}

reflects not only a single action but also the combined effect of joint actions from previous periods. As a result, decomposing

r_{i}^{t}

to evaluate the corresponding single action is challenging. So, to estimate action rewards, we define the local reward

r_{i}^{t}

of agent i at time step t as follows:

r_{i}^{t} = - \frac{1}{Δ t} \sum_{t^{'} = t + l t_{i}}^{t^{'} + Δ t} γ^{t^{'} - t} \cdot c_{r} \cdot C_{i}^{t^{'}},

(11)

where

Δ t

represents a time window used to estimate the rewards of a given action within the future,

l t_{i}

is the order lead time of agent i,

c_{r}

is the cost coefficient, and

C_{i}^{t}

is defined by Equation (1). Since the action

a_{i}^{t}

impacts the agent’s rewards only after the products have arrived and, with the arrival time being at least

t + l t_{i}

, the local reward

r_{i}^{t}

for agent i can be considered as the average of future costs within this time window. This allows agent i to approximate the reward for its action.

4.3. Critic Networks

Iqbal and Sha [14] proposed a Multi-Actor-Attention-Critic (MAAC) framework, which demonstrated exceptional performance in general MaRL scenarios. However, in SC networks, agents may be connected to varying numbers of upstream and downstream agents. Instead of adopting a broad approach like MAAC, where attention is evenly distributed among all agents, each agent should focus more on its immediate neighbors and adopt a decision-making policy that enhances cooperation with adjacent agents.

To address this issue, we incorporate SC network topological information into our proposed PMaRL framework. An overview of a critic model is depicted in Figure 2. Generally, each agent has its own encoder and decoder layers, while the embedding and attention structure are shared across all the agents. The encoder layer generates state embedding by mapping the observation o into a lower-dimensional representation. By sharing these state embeddings, each agent obtains the data encapsulating the current states and actions of the entire SC. Using the topological information, the shared attention structure extracts the hidden relationships among agents from the agents’ state embeddings, capturing the contributions of neighboring agents. Finally, the decoder maps these contributions and state embeddings into the Q-values to assess the performance of the agent’s action.

Initially, each agent’s observation

o_{i}^{t}

and action

a_{i}^{t}

are fed into the agent-specific encoder

f_{i}

, which is a multi-perceptron (MLP) network. This process generates the state embedding

e_{i}^{t}

at each time step t, as described by the following equation:

e_{i}^{t} = f_{i} (o_{i}^{t}, a_{i}^{t}) .

(12)

To model the relationship among agents in critic networks, we use an adjacency matrix M, which is an

n \times n

-dimensional matrix representing the network topology, where n is the number of agents. The SC networks can be considered as a directed graph. Each element

M_{i, j} \in [- 1, 0, 1]

captures the relation between agent i and agent j. A value of “−1” indicates that agent i is the predecessor (upstream) of agent j, “0” means agent j is not the neighbor of agent i, and “1” signifies agent i is the successor (downstream) of agent j. This matrix enables each agent to focus on the information from its neighboring agents during training. For each agent i, the list of connected neighbors

n b_{i}

can be derived from the adjacency matrix.

To promote cooperation among agents, we use multiple attention heads for each agent in the shared attention structure [39], shown in Figure 3. Each head has its own set of parameters (

W_{q}

,

W_{k}

,

W_{v}

). It takes the embeddings of agent i and its neighbors as input to generate the neighbor contribution

X_{i}

for agent i:

X_{i} = \sum_{j \in n b_{i}} β_{j} W_{v} e_{j},

(13)

β_{j} \propto e x p (e_{j}^{T} W_{k}^{T} W_{q} e_{i})

(14)

where

β_{j}

is the attention weight calculated based on the embedding similarity between agent i and agent j using a bilinear mapping (i.e., scaled dot product operation). Each head captures the latent relationships extracted from both the neighbors’ embeddings and the self-embedding. These latent relationships from all heads are then concatenated to form the final neighbor contribution

X_{i}

for agent i.

Once the attention structure generates

X_{i}

, the decoder g, which is an MLP network, converts

X_{i}

to the expected Q-value

Q_{i}

:

Q_{i} = g_{i} (e_{i}, X_{i}) .

(15)

In the SC network, agents may differ in the number of input/output edges, and therefore in their neighbor contributions. To classify the connectivity of agents, we define the number of incoming edges as the total number of downstream agents, and the number of outgoing edges as the total number of upstream agents. Agents with the same number of incoming and outgoing edges are grouped together based on the adjacency matrix, denoted as

G (i)

.

| G (i) |

is a function used to determine the size of group

G (i)

. Agents within the same group are expected to have similar structures and inventory policies.

To train the critic model, we create a replay memory that stores historical transition data (

s_{t}

,

a_{t}

,

s_{t + 1}

,

r_{t}

) to ensure training stability. At each training epoch, the loss function is defined as follows:

L_{θ} = \sum_{i = 1}^{N} σ_{G (i)} [{(Q_{i} (o_{i}, a_{i}) - y_{i})}^{2}],

(16)

σ_{G (i)} = \frac{1 + l o g (| G (i) |)}{| G (i) |}

(17)

where

y_{i}

is defined in Equation (10),

1 + l o g (| G (i) |)

represents the log-scaled group weight for the group

G (i)

, and

σ_{G (i)}

denotes the weight assigned to each agent within the group

G (i)

. This formula incorporates the log-scaled group weight into Equation (10) to balance the influence across groups of varying sizes and to encourage agents from different groups to explore.

Agents with the same number of incoming and outgoing edges exhibit similar functionalities. To account for this, we assign log-scaled weights based on group sizes. In contrast, treating every agent equally, as in the MAAC approach [14], can lead to the critic model disproportionately emphasizing the loss of agents in larger groups. This can lead to bias toward optimizing the loss of these agents, thereby discouraging exploration by agents in small groups. For example, if an SC network consists of 100 retailers and one wholesaler, the MAAC approach might cause the critic model to focus primarily on the retailers due to their similar structure and functionality. As a result, the wholesaler’s critic model may adopt the retailers’ policy to minimize the cumulative cost, limiting the scope of exploration. In such cases, the MAAC approach often leads to a local minimum and inadequate exploration. Unfortunately, real SC networks frequently have a large number of agents within the same group [26]. The MAAC approach may struggle to balance the exploration and exploitation, leading to suboptimal SC performance, as we demonstrate in Section 5.

In PMaRL, we tackle this issue by incorporating log-scaled group weights, allowing the critic model to fairly assess the actions of agents across different groups. The log-scaled group weight for group

G (i)

is calculated as

1 + log (| G (i) |)

. The weights assigned to agents within the same group,

σ_{G (i)}

, are distributed equally according to these log-scaled group weights. Compared to the MAAC approach, this method ensures that the agents in small groups have a greater impact on the loss function, enabling the critic models to better understand and address the imbalance in the SC network topology.

Subsequently, we apply the loss function to all agents and update their policy network gradients using Equation (9). The baseline

b (s)

is set to

E [Q_{i} (o_{i}, (a_{i}, a_{∖ i}))]

to estimate the value of the current state. Each agent samples the actions

a_{∖ i}

of other agents from their current policies to estimate its own gradient. Essentially, Equation (9) compares the value of the current action with the average action values of other agents to evaluate whether the current action contributes to the increased rewards.

5. Experiments and Results

In this section, we present a multi-stage SC simulation as an evaluation testbed. The experiments evaluate SC performance in terms of cost for various approaches, as well as the convergence of different MaRL methods. The simulation experiments were conducted on a platform equipped with an 11th Gen Intel Core i9 11900KF CPU, 64 GB memory, and NVIDIA RTX 3090 GPU.

5.1. Simulation Setup

To evaluate the performance of various inventory management approaches across different SC networks, we developed an SC simulation model as a test bed for assessing total SC costs. In this simulation model, each agent represents an entity within the SC network and is connected according to the network’s topology. At every time step of the simulation, each agent follows its respective inventory management policy to determine an order quantity, which is then sent to its upstream partner(s). The upstream agent fulfills the order based on its current inventory levels and initiates a shipment. If the order quantity exceeds the available inventory, the unfulfilled portion is recorded as a backorder, which incurs additional backorder costs. Upon receiving a new shipment, agents prioritize fulfilling any outstanding backorders. The total SC costs are calculated as outlined in Equation (2).

In our experiments, we construct four distinct SC networks based on the beer game (BG) [40], and three real SC networks (R1, R2, and R3) [26], which are illustrated in Figure 4; Table 1 summarizes their characteristics [41]. The columns labeled stages, nodes, edges, and network complexity correspond to the number of stages, nodes, and edges in each SC network. Network complexity represents the uncertainty inherent in dynamic logistics within the SC systems [41].

As shown in Table 1, the real SC networks are selected to show the SC performance at different levels of network complexity. For instance, R1 and R3 exhibit higher network complexity with numerous linkages (see Figure 4a,c). In contrast, R2 shows lower network complexity, featuring a simple and repetitive connection topology among wholesaler agents (see Figure 4b). The objective is to evaluate the SC costs of the PMaRL model across different network complexity levels. For each network topology, we adjust the shipment lead time from 1 to 3 days. Increasing the lead time results in higher inventory and backorder costs, thereby allowing us to assess the total cost under various scenarios.

In the simulation, the customer demand for agent i at each time step is modeled as

P (λ_{i})

, where

P

denotes the Poisson distribution, and

λ_{i}

is the mean of the Poisson distribution. The The Poisson distribution is chosen because it is commonly used for demand pattern generation [28]. To ensure a fair comparison, the total average demand generated across all retailers in an SC is fixed at 32, i.e.,

\sum λ_{i} = 32

[42]. The action spaces and the demand values are defined in Table 2:

The number of retailers varies by network: there are 1, 4, 9, and 8 retailers in BG, R1, R2, and R3, respectively, based on the SC network topology in the dataset. The action space

A

is influenced by the Poisson distribution, with lower Poisson means leading to smaller action spaces [31].

In the SC simulation, the duration is set to 50 time steps, each representing one day. The first 5 days serve as a warm-up period. The total cost Z (see Equation (2)) is calculated by summing the backorders and holding the costs of all agents in the SC after the warm-up period. To ensure the robustness of the experimental results, the total costs are averaged over 100 simulation runs.

5.2. SC Inventory Management Settings

We compare our proposed PMaRL method with other SC inventory management methods, including both traditional and MaRL approaches, across different SC networks, as shown in Table 1.

For traditional methods, the first comparison is the base-stock (BS) policy where a firm places a replenishment order whenever inventory falls below a predetermined base-stock level [43]. The reason for choosing the BS method is that it is straightforward to implement and widely used in SC inventory optimization scenarios [3]. This base-stock level is calculated using the formula

\bar{D} * L T + Z_{s e r} * s t d (D) * \sqrt{L T}

, where D is the set of retailer demands,

L T

is the order lead time,

Z_{s e r}

is the standard score for a

95 %

service level, and

s t d (D)

is the standard deviation of demand.

The second traditional method is the decomposition–aggregation (DA) heuristic, which approximates the optimal base-stock level by breaking down the SC network into simpler serial SCs [25]. The key advantage of using the DA method is that it does not require the details of the cost function and achieves high SC cost reduction in complex SC networks [44]. DA calculates the optimal base-stock levels for these serial SCs and then aggregates the results to derive a heuristic base-stock level. Similarly, the service level in DA is also set to

95 %

. The key distinction between these methods is that the BS approach uses local information to determine the base-stock level, while DA uses the information from all entities in the network to approximate the optimal level.

For the MaRL-based methods, we also compare two methods. The first is the MAAC approach [39], which allows agents to exchange their attention embeddings for better coordination. The second is our proposed PMaRL method. The last method is a variation of the PMaRL approach with full visibility (FV). The differences between the compared MaRL approaches are given in Table 3. In the critic network, both the PMaRL and FV models utilize the SC network topology, whereas the MAAC does not. In the policy network, both MAAC and PMaRL rely only on local information as input, while the FV model requires information from all agents.

5.3. Results—Comparison with Traditional Methods

Figure 5 presents a comparison of the normalized costs across different methods for various lead times, with costs normalized against those of the PMaRL approach. In the figure, the green, orange, and blue bars represent the normalized costs for the BS, DA, and PMaRL approaches, respectively.

The data in Figure 5 indicate that as the network complexity increases, the PMaRL approach consistently outperforms other methods, achieving greater cost savings in the SC. Specifically, the PMaRL approach reduces inventory costs by a factor of 1.4 to 1.6 compared to other methods in high-network-complexity scenarios. The PMaRL approach surpasses the BS method in all cases and shows notable inventory cost savings in high-network-complexity scenarios. The DA approach outperforms BS due to its increased visibility.

The PMaRL approach achieves comparable performance to DA in low-network-complexity scenarios and significantly better performance in high-network-complexity scenarios.

Overall, the experiment results consistently demonstrate that the PMaRL approach effectively reduces inventory costs, underscoring its effectiveness and versatility in handling complicated SC systems. Its ability to achieve cost reductions that benefit all entities highlights the practical value and applicability of the PMaRL approach in real-world SC scenarios.

5.4. Results—Comparison Among Different Multi-Agent-Based RL Methods

Figure 6 demonstrates the normalized cost for different MaRL methods under various lead times, with costs normalized against those of the FV approach. In the figure, the red, purple, and green bars represent the MAAC, FV, and PMaRL methods, respectively. The comparison with the FV approach highlights the differences in performance when full visibility is available versus when it is restricted, whereas the comparison with the MAAC approach emphasizes the advantages of the PMaRL method, which incorporates network topology and logarithmic group weight scaling.

In general, the PMaRL approach outperforms the MAAC approach as the number of nodes in the SC network increases. This is because in larger networks with numerous agents that have similar structures and functions, the MAAC approach tends to focus on minimizing the losses of these agents. This potentially leads to local optima and limits further exploration. By incorporating grouping based on network topology, the PMaRL approach strikes a more effective balance of exploration and exploitation, preventing premature convergence to suboptimal solutions.

When comparing the PMaRL approach to the FV approach, we observe comparable performance across all scenarios. This demonstrates that the PMaRL approach is capable of effective cooperation even in cases where information sharing is restricted or where privacy concerns exist. This capability is crucial for maintaining data privacy while still achieving performance levels on par with full visibility models.

5.5. Results—Training Time and Convergence

Figure 7 compares the average training time per episode among the PMaRL, FV, and MAAC approaches. The variations between different training episodes are negligible. The PMaRL approach is slightly faster than the FV approach. This small difference arises because the PMaRL approach only requires local information as input in its policy network, while the FV approach processes global information. As the number of nodes in the SC network increases, the PMaRL approach demonstrates faster training time per episode, leveraging network topology to focus agents’ attention on their neighbors. This helps learning to focus on relevant information, thereby reducing training time. In contrast, the MAAC approach requires agents to compute attention across all other agents, leading to distractions from less relevant information and slower training time.

Figure 8 shows the normalized inventory cost curves during the training process for R3, with the cost normalized against the corresponding BS approach costs. The green, red, and purple lines represent the inventory costs of the PMaRL, MAAC, and FV approaches, respectively. Although the MAAC approach converges faster, the PMaRL approach achieves lower overall inventory costs, further demonstrating its ability to effectively balance exploration and exploitation during training. Compared to the FV approach, the PMaRL approach yields similar inventory costs but converges more quickly.

5.6. Discussion

Overall, the PMaRL approach consistently outperforms traditional methods in complex SC networks. This success is due to the adaptability of the MaRL model, which generates effective decisions across various scenarios.

Traditional methods typically excel in simple SC networks, especially in the BG network, but struggle with large and complex ones [33]. In contrast, the PMaRL approach utilizes network topology information and explores various inventory policies to establish a robust strategy for the entire SC network. The MaRL approach enables policy models of agents to learn collaboratively during training, enhancing coordination in distributed decision making.

The PMaRL approach is particularly effective for optimizing complex SC networks because it leverages network topology information and does not require prior knowledge for decision making. In comparison, the MAAC approach performs well only in simple SC network topologies. The topology of an SC network has a major impact on the agent’s actions. As each agent’s demand originates exclusively from its downstream, the behaviors of the agent are significantly influenced by its neighbors. Without incorporating network topology information, the MAAC approach fails to fully explore these relationships, leading to higher inventory costs [45]. The PMaRL approach, on the other hand, incorporates a shared attention structure that utilizes network topology and incorporates logarithmic scale group weights in the loss function, making it more adept at handling complex SC network scenarios. Consequently, as the number of nodes in the SC network increases, the PMaRL approach consistently outperforms the MAAC approach.

The PMaRL approach delivers similar performance with faster convergence compared to the FV approach. This indicates that agents can effectively cooperate while making decisions based on local information. The FV approach, however, requires full visibility of information from all agents, which poses privacy concerns in real-world SC scenarios. The main contribution of the PMaRL approach is to address these privacy concerns by sharing only partial information during the training process, allowing agents to make independent decisions during actual operation.

6. Conclusions and Future Works

This paper introduces a novel data-driven decision-making approach, PMaRL, using a MaRL model that leverages network topology to optimize inventory costs across the entire SC. PMaRL trains in a semi-private manner, requiring firms to share information only during the training while being able to make decisions independently using just its local states. Our SC inventory optimization experiments show that the PMaRL approach outperforms traditional methods and MAAC while achieving performance comparable to the FV method. The main contributions of the PMaRL approach are as follows: (1) Enhanced SC coordination and improved training time; (2) Decentralized decision making to reduce the risk of information leakage.

However, some data are still required to be shared during the training, e.g., the model losses. Moving forward, our future research will focus on enhancing privacy protection for firms by integrating federated learning (FL) into our model. This approach will allow firms to train their models locally without sharing raw data. By combining FL with MaRL, we aim to reduce the risk of data leakage while simultaneously improving performance through better SC coordination among firms.

Other than that, further research will also focus on the issue of SC resilience. Especially after the COVID-19 pandemic, the highly uncertain environment has accelerated the demand for building resilient SCs [46]. It is a difficult task for SC agents to be required to make decisions well in normal scenarios and be capable of recognizing an emergency scenario and swiftly recovering from unexpected disruptions. A possible solution may be to use a transfer learning technique. When the demand patterns change, well-trained decision models can quickly adapt with minimal additional data. As a result, SC agents can adapt to disruptions.

Author Contributions

Conceptualization, B.Z., W.J.T., W.C., and A.N.Z.; data curation, B.Z.; formal analysis, B.Z.; funding acquisition, W.C.; investigation, B.Z.; methodology, B.Z.; resources, B.Z.; software, B.Z.; supervision, W.C. and A.N.Z.; validation, B.Z.; writing—original draft, B.Z. and W.J.T.; writing—review and editing, W.C. and A.N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2022-031).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Acknowledgments

Bo Zhang is supported under NTU PhD scholarship.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Panahifar, F.; Byrne, P.J.; Salam, M.A.; Heavey, C. Supply chain collaboration and firm’s performance: The critical role of information sharing and trust. J. Enterp. Inf. Manag. 2018, 31, 358–379. [Google Scholar] [CrossRef]
Zhang, B.; Tan, W.J.; Cai, W.; Zhang, A.N. Forecasting with Visibility Using Privacy Preserving Federated Learning. In Proceedings of the 2022 Winter Simulation Conference (WSC), Singapore, 11–14 December 2022; pp. 2687–2698. [Google Scholar]
Sánchez-Flores, R.B.; Cruz-Sotelo, S.E.; Ojeda-Benitez, S.; Ramírez-Barreto, M.E. Sustainable supply chain management—A literature review on emerging economies. Sustainability 2020, 12, 6972. [Google Scholar] [CrossRef]
Zhang, A.N.; Goh, M.; Meng, F. Conceptual modelling for supply chain inventory visibility. Int. J. Prod. Econ. 2011, 133, 578–585. [Google Scholar] [CrossRef]
Zavala-Alcívar, A.; Verdecho, M.J.; Alfaro-Saiz, J.J. A conceptual framework to manage resilience and increase sustainability in the supply chain. Sustainability 2020, 12, 6300. [Google Scholar] [CrossRef]
Nikolopoulos, K.; Punia, S.; Schäfers, A.; Tsinopoulos, C.; Vasilakis, C. Forecasting and Planning During a Pandemic: COVID-19 Growth Rates, Supply Chain Disruptions, and Governmental Decisions. Eur. J. Oper. Res. 2021, 290, 99–115. [Google Scholar] [CrossRef]
Rolf, B.; Jackson, I.; Müller, M.; Lang, S.; Reggelin, T.; Ivanov, D. A review on reinforcement learning algorithms and applications in supply chain management. Int. J. Prod. Res. 2022, 61, 7151–7179. [Google Scholar] [CrossRef]
Jackson, I.; Ivanov, D.; Dolgui, A.; Namdar, J. Generative artificial intelligence in supply chain and operations management: A capability-based framework for analysis and implementation. Int. J. Prod. Res. 2024, 62, 6120–6145. [Google Scholar] [CrossRef]
Lazar, S.; Klimecka-Tatar, D.; Obrecht, M. Sustainability orientation and focus in logistics and supply chains. Sustainability 2021, 13, 3280. [Google Scholar] [CrossRef]
Ramanathan, U. Performance of supply chain collaboration–A simulation study. Expert Syst. Appl. 2014, 41, 210–220. [Google Scholar] [CrossRef]
Chen, Y.; Özer, Ö. Supply Chain Contracts that Prevent Information Leakage. Manag. Sci. 2019, 65, 5619–5650. [Google Scholar] [CrossRef]
Kumar, A.; Shrivastav, S.K.; Shrivastava, A.K.; Panigrahi, R.R.; Mardani, A.; Cavallaro, F. Sustainable supply chain management, performance measurement, and management: A review. Sustainability 2023, 15, 5290. [Google Scholar] [CrossRef]
Barat, S.; Khadilkar, H.; Meisheri, H.; Kulkarni, V.; Baniwal, V.; Kumar, P.; Gajrani, M. Actor based simulation for closed loop control of supply chain using reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 1802–1804. [Google Scholar]
Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
Chen, L.; Dong, T.; Peng, J.; Ralescu, D. Uncertainty analysis and optimization modeling with application to supply chain management: A systematic review. Mathematics 2023, 11, 2530. [Google Scholar] [CrossRef]
Agrawal, T.K.; Kalaiarasan, R.; Olhager, J.; Wiktorsson, M. Supply chain visibility: A Delphi study on managerial perspectives and priorities. Int. J. Prod. Res. 2024, 62, 2927–2942. [Google Scholar] [CrossRef]
Barlas, Y.; Gunduz, B. Demand Forecasting and Sharing Strategies to Reduce Fluctuations and the Bullwhip Effect in Supply Chains. J. Oper. Res. Soc. 2011, 62, 458–473. [Google Scholar] [CrossRef]
Somapa, S.; Cools, M.; Dullaert, W. Characterizing Supply Chain Visibility—A Literature Review. Int. J. Logist. Manag. 2018, 29, 308–339. [Google Scholar] [CrossRef]
Yang, D.; Zhang, A.N. Impact of information sharing and forecast combination on fast-moving-consumer-goods demand forecast accuracy. Information 2019, 10, 260. [Google Scholar] [CrossRef]
Feizabadi, J. Machine Learning Demand Forecasting and Supply Chain Performance. Int. J. Logist. Res. Appl. 2022, 25, 119–142. [Google Scholar] [CrossRef]
Ried, L.; Eckerd, S.; Kaufmann, L.; Carter, C. Spillover Effects of Information Leakages in Buyer-Supplier-Supplier Triads. J. Oper. Manag. 2021, 67, 280–306. [Google Scholar] [CrossRef]
Tan, K.H.; Wong, W.P.; Chung, L. Information and Knowledge Leakage in Supply Chain. Inf. Syst. Front. 2016, 18, 621–638. [Google Scholar] [CrossRef]
Saha, E.; Ray, P.K. Modelling and analysis of inventory management systems in healthcare: A review and reflections. Comput. Ind. Eng. 2019, 137, 106051. [Google Scholar] [CrossRef]
Fokouop, R.; Sahin, E.; Jemai, Z.; Dallery, Y. A heuristic approach for multi-echelon inventory optimisation in a closed-loop supply chain. Int. J. Prod. Res. 2024, 62, 3435–3459. [Google Scholar] [CrossRef]
Rong, Y.; Atan, Z.; Snyder, L.V. Heuristics for base-stock levels in multi-echelon distribution networks. Prod. Oper. Manag. 2017, 26, 1760–1777. [Google Scholar] [CrossRef]
Willems, S.P. Data set—Real-world multiechelon supply chains used for inventory optimization. Manuf. Serv. Oper. Manag. 2008, 10, 19–23. [Google Scholar] [CrossRef]
Shang, K.H.; Song, J.S. Newsvendor bounds and heuristic for optimal policies in serial supply chains. Manag. Sci. 2003, 49, 618–638. [Google Scholar] [CrossRef]
Lesnaia, E. Optimizing Safety Stock Placement in General Network Supply Chains. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2004. [Google Scholar]
Ahmadi, E.; Masel, D.T.; Hostetler, S. A robust stochastic decision-making model for inventory allocation of surgical supplies to reduce logistics costs in hospitals: A case study. Oper. Res. Health Care 2019, 20, 33–44. [Google Scholar] [CrossRef]
Kim, B.; Kim, J.G.; Lee, S. A multi-agent reinforcement learning model for inventory transshipments under supply chain disruption. IISE Trans. 2024, 56, 715–728. [Google Scholar] [CrossRef]
Oroojlooyjadid, A.; Nazari, M.; Snyder, L.V.; Takáč, M. A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manuf. Serv. Oper. Manag. 2022, 24, 285–304. [Google Scholar] [CrossRef]
Fuji, T.; Ito, K.; Matsumoto, K.; Yano, K. Deep multi-agent reinforcement learning using dnn-weight evolution to optimize supply chain performance. Hawaii Int. Conf. Syst. Sci. 2018. [Google Scholar] [CrossRef]
Dehaybe, H.; Catanzaro, D.; Chevalier, P. Deep Reinforcement Learning for inventory optimization with non-stationary uncertain demand. Eur. J. Oper. Res. 2024, 314, 433–445. [Google Scholar] [CrossRef]
Kotecha, N.; Chanona, A.d.R. Leveraging Graph Neural Networks and Multi-Agent Reinforcement Learning for Inventory Control in Supply Chains. arXiv 2024, arXiv:2410.18631. [Google Scholar]
Nurkasanah, I. Reinforcement learning approach for efficient inventory policy in multi-echelon supply chain under various assumptions and constraints. J. Inf. Syst. Eng. Bus. Intell. 2021, 7, 138–148. [Google Scholar] [CrossRef]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12 (NIPS 1999). Available online: https://papers.nips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html (accessed on 30 September 2024).
Konda, V.; Tsitsiklis, J. Actor-Critic Algorithms. Advances in Neural Information Processing Systems 12 (NIPS 1999). Available online: https://papers.nips.cc/paper_files/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 30 September 2024).
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). Available online: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 30 September 2024).
Chen, F.; Samroengraja, R. The stationary beer game. Prod. Oper. Manag. 2000, 9, 19–30. [Google Scholar] [CrossRef]
Cheng, C.Y.; Chen, T.L.; Chen, Y.Y. An analysis of the structural complexity of supply chain networks. Appl. Math. Model. 2014, 38, 2328–2344. [Google Scholar] [CrossRef]
Zhao, Y. Evaluation and optimization of installation base-stock policies in supply chains with compound Poisson demand. Oper. Res. 2008, 56, 437–452. [Google Scholar] [CrossRef]
Graves, S.C.; Willems, S.P. Optimizing strategic safety stock placement in supply chains. Manuf. Serv. Oper. Manag. 2000, 2, 68–83. [Google Scholar] [CrossRef]
Goldberg, D.A.; Reiman, M.I.; Wang, Q. A survey of recent progress in the asymptotic analysis of inventory systems. Prod. Oper. Manag. 2021, 30, 1718–1750. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Ivanov, D. Exiting the COVID-19 pandemic: After-shock risks and avoidance of disruption tails in supply chains. Ann. Oper. Res. 2024, 335, 1627–1644. [Google Scholar] [CrossRef]

Figure 1. Multi-stage SC.

Figure 2. Overview of critic models.

Figure 3. Shared attention structure.

Figure 4. Three real SC networks [26]. The terms Ret, Whole, Dist, Manu, and Sup represent retailer, wholesaler, distributor, manufacturer, and supplier SC agents in order from downstream to upstream, respectively.

Figure 5. Normalized costs of different optimization methods with increasing network complexity. The value in parentheses denotes the network complexity.

Figure 6. Normalized costs of different MaRL methods with increasing number of nodes. The value in parentheses denotes the number of nodes.

Figure 7. Average training time per episode of different MaRL methods with increasing number of nodes. The value in parentheses denotes the number of nodes.

Figure 8. Convergence of different MaRL methods.

Table 1. SC network characteristics.

SC Network	No. Stages	No. Nodes	No. Edges	Network Complexity
Beer Game (BG)	4	4	3	1.5
Real Network 1 (R1)	5	17	18	2.82
Real Network 2 (R2)	4	22	39	1.91
Real Network 3 (R3)	5	27	31	2.27

Table 2. SC network parameters.

SC Network	Retailers	Retailer Demand	Action Space $A$
BG	1	$d_{1}^{t} \sim P (32)$	{−8, 24}
R1	4	$d_{1}^{t} = d_{2}^{t} = d_{3}^{t} = d_{4}^{t} \sim P (8)$	{−4, 10}
R2	9	$d_{1}^{t} = d_{2}^{t} = \dots = d_{9}^{t} \sim P (32 / 9)$	{−3, 6}
R3	8	$d_{1}^{t} = d_{2}^{t} = \dots = d_{8}^{t} \sim P (4)$	{−3, 6}

Table 3. Differences between MaRL methods.

Method	Critic Network	Policy Network
MAAC	Without SC network topology	Local Information
PMaRL	With SC network topology	Local Information
FV	With SC network topology	Global Information

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Tan, W.J.; Cai, W.; Zhang, A.N. Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization. Sustainability 2024, 16, 9996. https://doi.org/10.3390/su16229996

AMA Style

Zhang B, Tan WJ, Cai W, Zhang AN. Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization. Sustainability. 2024; 16(22):9996. https://doi.org/10.3390/su16229996

Chicago/Turabian Style

Zhang, Bo, Wen Jun Tan, Wentong Cai, and Allan N. Zhang. 2024. "Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization" Sustainability 16, no. 22: 9996. https://doi.org/10.3390/su16229996

APA Style

Zhang, B., Tan, W. J., Cai, W., & Zhang, A. N. (2024). Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization. Sustainability, 16(22), 9996. https://doi.org/10.3390/su16229996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging Multi-Agent Reinforcement Learning for Digital Transformation in Supply Chain Inventory Optimization

Abstract

1. Introduction

2. Literature Review

2.1. Information Sharing

2.2. Inventory Optimization Approach

2.3. Reinforcement Learning

3. Research Objective

4. Methods

4.1. Fundamentals

4.2. Policy Networks

4.3. Critic Networks

5. Experiments and Results

5.1. Simulation Setup

5.2. SC Inventory Management Settings

5.3. Results—Comparison with Traditional Methods

5.4. Results—Comparison Among Different Multi-Agent-Based RL Methods

5.5. Results—Training Time and Convergence

5.6. Discussion

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI