Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets

Garcia-Cantón, Sergi; Ruiz de Mendoza, Carlos; Cervelló-Pastor, Cristina; Sallent, Sebastià

doi:10.3390/app15031122

Open AccessArticle

Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets

by

Sergi Garcia-Cantón

,

Carlos Ruiz de Mendoza

,

Cristina Cervelló-Pastor

and

Sebastià Sallent

^*

Department of Network Engineering, Universitat Politècnica de Catalunya (UPC), Castelldefels, 08860 Barcelona, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1122; https://doi.org/10.3390/app15031122

Submission received: 24 December 2024 / Revised: 17 January 2025 / Accepted: 20 January 2025 / Published: 23 January 2025

(This article belongs to the Special Issue Novel Advances in Internet of Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Future autonomous vehicles will interact with traffic infrastructure through roadside units (RSUs) directly connected to transportation field cabinets (TFCs). These TFCs must be interconnected to share traffic information, enabling infrastructure-to-infrastructure (I2I) communications that are reliable, synchronous and capable of transmitting vehicle data to the Internet. However, I2I communications present a complex optimization challenge. This study addresses this by proposing the design, implementation, and evaluation of an automated management model for I2I service channels based on multi-agent reinforcement learning (MARL) integrated with deep reinforcement learning (DRL). The proposed models efficiently manage the routing and scheduling of data frames between internet of vehicles (IoV) infrastructure devices through time-sensitive networking (TSN) to ensure real-time synchronous I2I communications. The solution incorporates both a routing model and a scheduling model, evaluated in a simulated shared environment where agents operate within the TSN control plane. Both models are tested for different topologies and background traffic levels. The results demonstrate that the models establish the majority of paths in the scenario, adhering to near-optimal routing and scheduling policies. Recursively, for each individual request to create a service channel, the system establishes online an optimal synchronous path between entities with a limited time budget. In total, 71% of optimal routing paths are established and 97% of optimal schedules are achieved. The approach takes into account the periodic nature of the transmitted data and its robustness through TSN networks, obtaining 99 percent of compliant service requests with flow jitter levels below 100 microseconds for different topologies and different network utility percentages. The proposed solution achieves lower execution delays compared to the iterative ILP approach. Additionally, the solution facilitates the integration of 5G networks for vehicle-to-infrastructure (V2I) communications, which is identified as an area for future exploration.

Keywords:

IoV; RSU; I2I; TSN; TMS; TMC; smart road side infrastructure; MARL; DRL; routing and scheduling optimization; SDN

1. Introduction

Modern communication systems are transforming vehicles into increasingly connected devices, with projections indicating that by 2030, there will be over 645 million connected cars worldwide [1]. These connections support the deployment of various services, including infotainment, cooperative traffic control, navigation, and safety features [2], all aimed at enhancing the experience of drivers and passengers. Looking ahead, vehicles are envisioned as fully autonomous systems capable of transporting people or goods by simply receiving instructions at the start of a journey. In this paradigm, the task of driving will be entirely eliminated, removing the need for human drivers to control the vehicle’s behavior. Autonomous vehicles are expected to enhance road safety for pedestrians, optimize the utilization of smart roadside infrastructure, and significantly improve the accessibility to transportation options for individuals such as the elderly, children, or those with injuries [3]. Vehicles will achieve autonomy by maintaining constant communication with surrounding vehicles (V2V, vehicle-to-vehicle communications) and the supporting infrastructure. These secure, real-time communications must be highly reliable, synchronized, and characterized by low latency. Collectively, these interactions are encompassed within the Internet of Vehicles (IoV) framework.

Autonomous vehicles will also require secure, efficient, and reliable internal communication systems to facilitate data exchange among sensors, cameras, and controllers. Traditional in-vehicle communication systems rely on proprietary solutions, which are heavily manufacturer-dependent, suffer from interoperability challenges, and fail to meet the bandwidth demands of modern devices [4]. These limitations have prompted vehicle manufacturers and researchers to explore the feasibility of transitioning to solutions based on open standards [5]. One such proposed standard for implementing real-time systems is time-sensitive networking (TSN), a suite of IEEE standards designed to enable ultra-reliable, real-time, and cost-effective communications over Ethernet [6]. Ethernet is widely recognized for its flexibility and extensive adoption, making it an ideal foundation for TSN-based systems.

Future autonomous vehicles are anticipated to connect to traffic infrastructure—and by extension, the Internet—via components known as roadside units (RSUs) [7]. These RSUs are directly linked to roadside cabinet electronics (RSCE) systems housed within transportation field cabinets (TFCs). Equipped with computing capabilities, TFCs are responsible for collecting and analyzing traffic data transmitted by both vehicles and infrastructure devices [8]. Additionally, they can also be configured to manage local traffic by using traffic management systems (TMSs). In the IoV framework, the communication between vehicles and RSUs is referred to as vehicle to infrastructure (V2I) communication. Many researchers advocate for implementing V2I communications using 5G technology due to its advanced capabilities [9]. RSCEs must be interconnected to enable the seamless exchange of traffic information, forming the basis of infrastructure-to-infrastructure communications (I2I). These communications must ensure reliability, synchronization, and the capability to transmit vehicle data to other vehicles, traffic management centers (TMCs), and the Internet. [10]. The network topology connecting TFCs can vary significantly depending on the distribution and density of roads. Consequently, the communication technology used between infrastructure devices must be both scalable and flexible to accommodate these variations [11]. Unlike in in-vehicle communications, where network topologies are relatively simple (e.g., rings or buses) and involve a limited number of devices [12], I2I communications present a much more complex optimization challenge.

Considering (i) the ongoing advancements in designing TSN deployments for in-vehicle communications [13], (ii) the similarity in requirements between in-vehicle and I2I communications [4,11], and (iii) the capability to interconnect TSN networks via 5G [14], we propose adopting a smart infrastructure based on deep reinforcement learning (DRL) and TSN standards. This framework aims to facilitate efficient and reliable communication between IoV infrastructure devices, including RSUs, RSCEs and TMCs from different TFCs. The structure of TSN networks typically adheres to a software-defined networking (SDN) approach, enabling centralized management to achieve the necessary levels of scalability and flexibility. In this proposal, the introduction of DRL facilitates the automatic and real-time creation of reconfigurable, reliable, synchronous, low-latency, and secure service channels between the road infrastructure and the various TMCs of providers. The proposed mechanism manages online the arrival of requests for flow creation associated with service channels. As an entity demands a request for a synchronous flow between two entities, the system creates the optimal route in real time, keeping the flow synchronization limited to a maximum end-to-end communication delay. However, this solution introduces significant management challenges, as creating optimal paths between devices is a computationally complex problem involving both routing and scheduling [15]. This article describes the design, implementation, and evaluation of an automated management model for I2I service channels based on multi-agent reinforcement learning (MARL) frameworks using DRL. These models efficiently manage the routing and scheduling of data frames between IoV infrastructure devices through TSN networks, enabling real-time, synchronous I2I communications. Routing and scheduling models are developed, and a simulated shared environment is created to evaluate the behavior of both agents, which operate within the TSN control plane. The choice of TSN for communications between infrastructure devices ensures the required levels of quality of service (QoS) to support critical applications, such as traffic congestion management or broadcasting accident-related alerts. Additionally, this approach facilitates the deployment of TSN for all wired IoV communications and enables the integration of 5G and TSN for V2I communications. However, the detailed study of V2I communications leveraging 5G networks and TSN is identified as an area for future work.

This work makes two primary contributions:

1.: The development of a critical communications network: A TSN-based framework is proposed to interconnect all components of I2I communications, serving as the backbone for future driverless car services. This network achieves ultra-low response times of less than one millisecond, ensuring the reliability and efficiency required for autonomous vehicle operations.
2.: The automated real-time management of services: A system leveraging DRL and multi-agent techniques is introduced to manage service demands across network elements (V2I, I2I, and TMC) in real time. This approach eliminates human intervention while ensuring optimal performance and scalability.

The rest of this article is organized as follows: Section 2 introduces the research context and reviews related works. Section 3 defines the problem formulation for integer linear programming (ILP) and DRL models for a joint routing and scheduling solution. Section 4 and Section 5 detail the proposed routing and scheduling solutions, respectively. Section 6 presents the evaluation methodology and results for the models. Finally, Section 7 concludes with a summary of findings and their implications.

2. Related Work

One of the primary applications of TSN is in the industrial sector, where it is anticipated to play a pivotal role in advancing Industry 4.0. In [16], several migration strategies were proposed to transition from existing proprietary industrial networks to a unified TSN-based architecture. The study emphasized that this migration should be implemented gradually to mitigate potential disruptions and accommodate the complexities of the transition. In the automotive sector, ref. [5] proposed several architectures for in-vehicle networks, emphasizing the significant advantages of utilizing TSN in these scenarios. The authors concluded their study by underscoring the necessity of implementing architectural modifications in in-vehicle networks to fully leverage the capabilities of TSN. The authors in [17] proposed a TSN-based SDN architecture for in-vehicle communications and validated its performance on a commercial car. The study highlighted the solution’s capability to manage both time-critical and best-effort traffic effectively. The architecture assumes a zoned topology, enabling the coexistence of Ethernet and controller area network (CAN) protocols by implementing gateways at each zone. In this setup, sensors continue to communicate using the CAN protocol, while the core of the network operates on Ethernet, ensuring improved scalability and performance. In [14], a comprehensive review of the technologies enabling the integration of 5G networks with TSN is presented, offering a detailed summary of the state-of-the-art literature on this topic. The integration of 5G and TSN holds significant potential, particularly in scenarios where both in-vehicle communication networks and infrastructure communications are TSN-based. This convergence promises to enhance network performance, ensuring ultra-reliable, low-latency communication critical for emerging automotive and industrial applications.

Research on RSUs mainly focuses on three key areas: the optimal placement of RSUs along roads, the optimal resource allocation for autonomous vehicles, and the development of innovative architectures for the IoV. The authors in [10] analyzed the effectiveness of an RSU deployment in a highway scenario, comparing independent RSUs with interconnected RSUs. They concluded that deploying disconnected RSUs offers minimal improvement in reducing message dissemination delays. In contrast, interconnected RSUs significantly enhance performance, reducing delays by orders of magnitude. The researchers in [18] developed an RSU cloud infrastructure that provides computation capabilities, which are utilized by connected vehicles as virtual machines. To optimize the system, they proposed a multi-objective ILP model aimed at minimizing the infrastructure delay, the number of deployed RSUs, and the frequency of virtual machine reconfigurations. The model tries to keep virtual machines as close as possible to the connected vehicle they are serving. Furthermore, a reinforcement learning (RL) agent was implemented to ensure that the number of reconfigurations is minimized over the long term, enhancing the overall system efficiency. In [19], the authors proposed an ILP method for the joint deployment of RSUs and the assignment of vehicles’ service tasks to the deployed RSUs. The authors assumed the allocation of RSUs in crossroads and defined service areas along the surrounding roads, ensuring overlapping coverage between neighboring RSUs. The model incorporates deployment and maintenance costs as factors and enforces a maximum end-to-end delay as a constraint. The authors of [20] proposed a 5G-based IoV architecture based on SDN and fog-cloud computing. RSUs are grouped in fog clusters managed by an RSU controller, which is connected to a centralized SDN controller overseeing multiple fog clusters. The researchers proposed leveraging 5G base stations for V2I communications as a fallback when RSUs become congested. They also presented an optimization model aimed at minimizing service delay and energy consumption while ensuring load balancing across the network.

To the best of our knowledge, no existing literature addresses the DRL-based optimization of TSN routing and scheduling for IoV and I2I communication networks. Furthermore, the solution proposed in this study is the first to employ MARL to tackle TSN’s routing and scheduling challenges. A related study, ref. [12], developed an ILP-based routing and scheduling model for TSN-SDN in-vehicle networks considering some vulnerabilities in the Time-Aware Shaper (TAS)—a critical traffic shaper for TSN networks. The proposed approach mitigated TAS vulnerabilities, making it applicable to in-vehicle networks. Ref. [21] proposes an algorithm that solves Edge Disjoint Rooted Distance-Constrained Minimum Spanning trees problem with a limited budget. The proposal develops two local operators, which obtain disjoint routes between clients and facilities with budget limits based on a local search algorithm ensuring fault tolerance. The algorithm requires that all requests between clients and facilities are initially known. In [22], the authors propose a static routing and scheduling algorithm based on mixed integer programming (MIP) in the in-vehicle network. The algorithm controls the TSN Gate Control Lists by simultaneously optimizing the routing and scheduling of dynamic traffic as the vehicle drives off. The work focuses on the static scheduling of load-balanced flows across TSN slots by optimizing dynamic requests. This offline model is evaluated for a small number of nodes with a priori knowledge of the demands. Similarly, ref. [23] evaluated an ILP-based routing and scheduling model for TSN networks, analyzing its performance under increasing network sizes and the traffic loads across different topologies. While the schedules produced were near-optimal, the high computational runtime rendered the model unsuitable for real-time scenarios, particularly in automotive environments. In contrast, ref. [24] proposed a DRL-based model to solve the routing problem in TSN networks. A key innovation of their work was the integration of graph convolutional networks within the DRL agent. The model accounted for the coexistence of time-critical and best-effort traffic by using multiple priority-based queues. However, the DRL agent did not address the assignment of data flows to these priority queues, leaving a gap in the comprehensive handling of TSN scheduling.

3. Problem Definition

This section outlines the modeling of the infrastructure environment and the problems addressed in this study. Consistent with previous works on RSU deployment [19], we assume RSUs are deployed at urban crossroads. In line with established standards [7], it is further assumed that all the RSUs at a given crossroad are connected to a TFC equipped with a RSCE unit, forming a star topology. The TFCs are interconnected to ensure multiple paths between any pair of nodes, providing enhanced reliability in terms of connectivity. In the proposed model, each TFC is equipped with a TSN switch to enable TSN functionalities. Additionally, each RSU is equipped with cellular or wireless communication technology and a global navigation satellite system (GNSS), which are utilized to establish V2I communications. Figure 1 illustrates the proposed infrastructure architecture at a crossroad. By deploying synchronous network systems such as TSNs, the model achieves very high levels of reliability and ultralow delays, with submillisecond latency.

The I2I infrastructure consists of a network of interconnected TFC nodes, equipped with a secure credential management system (SCMS) and integrated with the TMS of various service operators via a high-capacity TSN fixed network. Each node of the I2I network features a TSN switch, enabling the creation of low-latency service channels between the TFC nodes and their corresponding service operators. These channels are designed to be reliable and secure, ensuring that the controlled traffic between different service providers remains independent. The proposed architecture is based on a centralized TSN network governed by a control plane. This control plane is managed by MARL algorithms, which automate the processes for creating, maintaining, and closing the service channels of the data plane. This real-time operation eliminates the need for human intervention, ensuring seamless and efficient management of the infrastructure.

As shown in Figure 2, the communication control plane is composed of a centralized user configuration (CUC) module and a centralized network configuration (CNC) module. The CUC module is responsible for handling end-user service requests and processing basic network parameters. The CNC module, on the other hand, receives these service requests from the CUC and configures the TSN switches accordingly to meet the user demands. Our proposed models are designed to be integrated into the CNC module, enabling it to efficiently manage the communication data plane in response to dynamic service requests over time. This integration ensures seamless operation and adaptability to changing network requirements, leveraging the CNC module’s centralized control capabilities.

The TSN network connecting all of the infrastructure devices is modeled as a directed graph

G = (V, E)

, where V represents the set of TSN switches located within TFCs and TMSs, and E denotes the set of directed links connecting these entities. The cumulative delay introduced by an edge from source node i to destination j is defined as

d_{i, j}

, which accounts for transmission, propagation, and queuing delays. One of the simulated topologies is depicted in Figure 3. In the figure, arrows indicate unidirectional edges between TFCs and TMSs, while the numbers next to the edges represent the delays

d_{i, j}

in milliseconds. Although the figure focuses on the data plane, it is essential to note that entities of the control plane are also included in this study, following the centralized approach described earlier.

The TSN architecture provides an infrastructure capable of handling critical services, ensuring that information frames are transmitted synchronously with a transmission period chosen by the transmitting and receiving entities through the CUC module. The proposed model ensures that whenever a vehicle or any network element needs to transfer critical information, a free and synchronous channel is always available. This channel guarantees a constant transmission rate with minimal transmission delay and limited delay variation between entities, ensuring reliable and efficient communications.

This synchronous flow of control data establishes a service channel, defined as a TSN unidirectional transmission of data frames from a source device, referred to as a talker, to a destination device, referred to as a listener. Connections between RSCEs and their corresponding service providers managing the TMS are assumed to facilitate the exchange of control data. These data must adhere to strict delivery requirements, including maximum delay and jitter bounds. These connections are periodic and have fixed frame sizes. A service request sent from an end device, such as a TFC or a TMS, to the CUC module is based on a tuple: (

R_{s r c}

,

R_{d s t}

,

R_{l e n}

,

R_{p r d}

,

R_{d e l a y}

). For each parameter in this tuple, the following assumptions are made:

$R_{s r c}$ and $R_{d s t}$ are integer identifiers for the TFCs and TMS where the source and destination devices are directly connected, with $R_{s r c}, R_{d s t} \in {0, . . ., V}$ .
$R_{l e n}$ represents the frame size in bytes, with possible values $R_{l e n} \in {128, 256, 512, 1024,$ $1500}$ .
$R_{p r d}$ is the frame transmission period in milliseconds, with $R_{p r d} \in {2, 4, 8, 16}$ .
$R_{d e l a y}$ is the maximum acceptable end-to-end delay in milliseconds, with $R_{d e l a y} \in {20, 23, 26, 29}$ .

The proposed scenario is inspired by a subset of streets and crossroads of downtown Barcelona. Then, the service requests that are randomly generated are based on information about aggregated traffic.

Moreover, some concepts related to time division and link scheduling are defined. Since data frames are sent periodically, the scheduling of frames is also periodic. Given the multiple possible period values, the overall period, referred to as the hyperperiod, is the least common multiple (LCM) of all period values. The data flow scheduling along a hyperperiod follows a consistent pattern unless a new service request is received or an existing connection stops sending frames. The hyperperiod is divided into equally long slots of 100 microseconds, referred to as the slot time). Assuming that all network links operate at 1 Gbps, it is possible to transmit 12,500 bytes in each slot, which is termed the slot size). These concepts—hyperperiod, slot time, and slot size—are illustrated in Figure 4.

To simplify the scheduling of data flows with specific periodicities, we introduce the concept of positions, which represent groups of slots that can be used for transmitting a data flow. If the hyperperiod is divided into groups of

R_{p r d} / s l o t_{t i m e}

slots, the total duration of each group is equal to

R_{p r d}

. A data flow whose periodicity is

R_{p r d}

can be transmitted by selecting the same time-ordered slot within each group. These positions satisfy the following property: their slots are equally spaced by

R_{p r d}

milliseconds. As a result, selecting a single position ensures that the periodicity between consecutive data frames is maintained within the time duration of a slot, providing a straightforward and consistent mechanism for periodic data flow scheduling. The definition of positions depends on the period of the service request, meaning that the number of possible positions for scheduling frames is different with each

R_{p r d}

value.

With the architecture of critical I2I communications now defined, we propose a multi-agent solution based on DRL to identify and maintain optimal routes while performing optimal scheduling for the requested traffic. This approach offers the significant advantage of delivering optimal flow services in real time without requiring intervention from network operators or service providers. The following sections describe the implementation of the routing and scheduling solutions, which are built using multi-agent value-based DRL agents. Each solution is modeled as a Markov decision process (MDP) to formalize the decision-making framework.

ILP Formulation

This section provides the mathematical formulation for managing the routing and scheduling of data frames between IoV infrastructure devices through TSN to ensure real-time synchronous communications. This problem could be defined as an ILP model to determine the routing and periodic slots for each service request arrival to meet its requirements. The objective is to minimize the end-to-end delay in transmitting data frames. The ILP is called the One-Request Joint Routing and Slot-Capacity-Aware Flow Placement (One-Req-RSCAFP) problem.

As defined in the previous section, the TSN network scenario is modeled by a directed weighted graph

G = (V, E)

, adopting a specified topology, being

V

the set of nodes, and

E

the set of edges. The nodes are the TSN switches inside the TFCs and TMSs. Each link

(i, j) \in E

, being

i, j \in V

, is characterized by a delay

D_{i j}

, which allows us to evaluate and optimize the delay performance of the transmissions under various conditions. Each link

(i, j) \in E

is slotted with a hyperperiod S that emulates the TSN synchronous network. Moreover, let

C_{i j}^{s}

be the capacity of the link slot s to allow communication on the network into a fixed length, repeating time cycles determined by the period of the service request.

Let consider a service request R defined with the tuple (

R_{s r c}

,

R_{d s t}

,

R_{l e n}

,

R_{p r d}

,

R_{d e l a y}

).

Thus, this work defines an iterative procedure that uses the model One-Req-RSCAFP at each request arrival determining a route with enough periodic slot capacity considering different background traffic levels. Next, having the optimal valid route, the final scheduling is determined using a simple heuristic.

To formulate the One-Req-RSCAFP problem, let us define a binary variable

x_{i j}

indicating if a link

(i, j)

is part of the constrained route of the service request:

\begin{matrix} x_{i j} = \{\begin{matrix} 1 & , if the edge (i, j) is used to transmit the service request \\ 0 & ; otherwise, \end{matrix} \end{matrix}

(1)

and let

f_{i, j}^{s}

be a binary variable specifying if a slot s of the link

(i, j)

is used for the transmission:

\begin{matrix} f_{i j}^{s} = \{\begin{matrix} 1 & , if periodic slots starting at s of the edge (i, j) are used to transmit the service request \\ 0 & , otherwise . \end{matrix} \end{matrix}

(2)

With all of these considerations, the ILP for the One-Req-RSCAFP problem is:

\begin{matrix} (3) & \min & \sum_{\forall (i, j) \in E} \sum_{\forall s \in S} f_{i j}^{s} \cdot D_{i j} \\ (4) & s . t . : & \sum_{\forall (i, j) \in E} \sum_{\forall s \in S} f_{i j}^{s} - \sum_{\forall (j, i) \in E} \sum_{\forall s \in S} f_{j i}^{s} = 1 & i f i = R_{s r c}, i \in V \\ (5) & \sum_{\forall (i, j) \in E} \sum_{\forall s \in S} f_{i j}^{s} - \sum_{\forall (j, i) \in E} \sum_{\forall s \in S} f_{j i}^{s} = 0 & \forall i \in V | \neq {R_{s r c}, R_{d s t}} \\ (6) & f_{i j}^{s} \cdot R_{l e n} \leq C_{i j}^{s} & \forall (i, j) \in E, \forall s \in S \\ (7) & if x_{R_{s r c} j} = 1 \Rightarrow f_{R_{s r c} j}^{s} = 1 & s with enough periodic capacity, j \in V \\ if f_{i j}^{s} = 1 & x_{j k} = 1 \Rightarrow f_{j k}^{s^{'}} = 1 & \forall (i, j) \in E | i! = R_{s r c}, \forall k \in V | (j, k) \in E, \\ (8) & s^{'} \geq s \mod R_{p r d} with periodic capacity \\ (9) & f_{i j}^{s} \leq x_{i j} & \forall (i, j) \in E, \forall s \in S \\ (10) & \sum_{\forall (i, j) \in E} x_{i j} \cdot d_{i j} \leq R_{d e l a y} \\ (11) & x_{i j} b i n a r y & \forall (i, j) \in E \\ (12) & f_{i j}^{s} b i n a r y & \forall (i, j) \in E, \forall s \in S . \end{matrix}

The objective given in (3) aims to minimize the total cost delay. Constraints (4) and (5) formulate the flow conservation property at the slot level needed to ensure a correct path leading to the destination, the first for the source node and the second for other nodes in the network. Constraint (6) ensures that the data frame transmitted on a slot does not exceed its capacity.

Constraint (7) determines that if a link starting from the request’s origin is active (thus,

(R_{s r c}, j)

is a link of the path used) has a slot, s, with sufficient periodic capacity within the link, the corresponding variable

f_{R_{s r c} j}^{s}

is set to 1. This constraint is non-linear, but it can be linearized, being equivalent to the following linear restriction:

x_{R_{s r c} j} \leq f_{R_{s r c} j}^{s}

(13)

being

j \in V

and s a valid slot, i.e., a slot from which periodic enough capacity exists.

Constraint (8) considers two contiguous links,

(i, j)

and

(j, k)

, which are part of the path to be followed by the request, in which the first link has a slot, s, with sufficient periodic capacity, that is,

f_{i j}^{s} = 1

. In this situation, the second link must have a slot from which there is enough capacity, according to the periodicity of the request, at the same position as the first link or later

(s^{'} \geq s)

, i.e.,

f_{j k}^{s^{'}} = 1

. This constraint is also nonlinear and, through linearization techniques, is equivalent to the following:

f_{j k}^{s^{'}} \geq x_{j k} + f_{i j}^{s} - 1

(14)

being

(i, j) \in E

a link with initial node i different of the origin

R_{s r c}

,

(j, k) \in E

and

s^{'} \geq s \mod R_{p r d}

.

Constraint (9) ensures that a slot on an inactive link (i.e., not being part of the resulting path) is not occupied. In addition, constraint (10) guarantees that the total path delay does not exceed the maximum acceptable end-to-end delay of the request.

Finally, all variables are defined as binary in constraints (11) and (12).

The ILP requires a high computation time to obtain a solution, making its use in real time unfeasible. For example, in a scenario with 600 to 900 background flows, the average computation time required to obtain a result with the ILP is 42.25 ms, while the time spent on the proposed solution is about 4 ms.

For this reason, this work chooses a method based on MARL integrated with DRL. Once the trained model is obtained, this method allows for finding a time-efficient solution close to the optimum and with low computational complexity.

4. MARL-Based Routing Solution

This model is designed to route data flows by creating paths from the talker to the listener, following the shortest possible path formed based on the total delay of the links. While the primary objective is to establish the shortest paths, alternative routes with slightly higher delays may be used if sufficient resources are unavailable along the shortest path. However, these alternative routes must always comply with the maximum delay bound specified in the service channel request. The episode length in this model is variable, ending either when the destination TSN switch is reached or when the maximum delay bound is exceeded. This model is defined as an MDP, which is described in detail below.

4.1. Action Space

Actions in this model are performed hop by hop, meaning that the complete path is constructed as a sequence of actions

{a_{0}, a_{1}, . . ., a_{n}}

. Each action corresponds to the identifier of the next TSN switch to continue forming the path. If the network is too congested to allocate resources for a specific data flow, the agent must be capable of deciding to stop routing frames. As such, the action space includes the identifier of all of the TSN switches (representing the TFCs and TMSs in the infrastructure) with an additional action to reject the service request.

To enhance the learning process, the action space is filtered before selecting the actions to perform. This filtering mechanism applies three criteria:

1.: Connectivity filter: Discarding actions whose related node is not connected to the one up to which the path has been created.
2.: Resource availability filter: Discarding actions that involve scheduling data frames on a link with insufficient resources. Although the link might have enough free bytes along the hyperperiod, this criterion depends on the availability of all possible positions; if there is at least one position with enough free bytes, then the link can be used.
3.: Loop prevention filter: Discarding actions that involve returning to the previous node, as this is not feasible in TSN networks.

These filtering mechanisms ensure that the agent cannot attempt impossible actions, leading to faster training processes and more reliable final results.

4.2. State Space

In DRL, the environment must provide relevant information to the agent so that it can learn an optimal policy. This information represents a partial observation of the current state of the environment. At each time step, the environment sends the following information to the agent:

The source and destination node identifiers (TFC or TMS) specified in the service request ( $s r c$ and $d s t$ ).
The current node, which is the node up to which the path has been established.
The maximum delay bound specified in the service request and the accumulated delay in the path formed up to that point.
A state vector related to the resource status of the links outgoing from the current node. This vector has as many elements as nodes in the network. Each element of the vector generally contains the delay of the link connecting the current node to a specific node in ascending order (e.g., the first element contains the delay of the link from the current node to node 0 and so forth). If no link exists between the current node and another node, the corresponding element in the vector is −1. Similarly, if a link exists but is congested and cannot transmit data frames, the value for that link is also set to −1. This vector provides the agent with enough information to know whether a link has enough resources to be assigned to a specific data flow, thus avoiding congestion issues or selecting failed links. The reason why any link cannot be used to assign resources for service requests is transparent to the agent, as it can only observe whether the link is available or not.

4.3. Reward Modeling

Rewards are the mechanism by which the environment informs the agent about the quality of its policy. A reward function is developed to guide the agent in learning how to establish optimal paths whenever possible and other paths in the rest of the cases. The expression for the reward function is presented in Equation (15):

r_{n} = r_{n - 1} - d_{i, j} + δ (50 + 100 ϕ),

(15)

where:

$r_{n}$ is the reward at step n.
$r_{n - 1}$ is the accumulated reward of the episode, initialized to 0 at the beginning.
$d_{i, j}$ is the delay of the selected link at step n.
$δ$ is a binary value set to 1 when the path reaches the destination TSN switch $d s t$ ; otherwise, it is 0.
$ϕ$ is another binary value set to 1 if the shortest path is formed between the source and destination end devices; otherwise, it is 0.

In summary, the delay of the selected link is applied as a penalty at each step by subtracting

d_{i, j}

. Upon reaching

d s t

, the agent receives a positive reward of 50 points with an additional 100 points granted if the optimal path has been established. This structure encourages the agent to minimize delays while prioritizing the formation of optimal paths.

5. MARL-Based Scheduling Solution

Once the routing agent identifies the next link for traffic scheduling, the scheduling agent takes responsibility for assigning data frames to time slots, aiming to minimize buffering time at intermediate TSN switches. The primary objective is to select positions that allow data frames arriving at an intermediate node to be forwarded immediately, avoiding queuing delays. This process is based on the following assumption: if node n schedules a data flow at position i, the frames will arrive at node

n + 1

at a time slot corresponding to position i. Ideally, the optimal policy is to maintain consistency by scheduling data frames in the same position as the previous link whenever possible. If resource constraints make this infeasible, the next available position in the temporal domain should be selected, following the minimum residence time criterion. Conversely, selecting a position immediately before the optimal one in the timing domain is highly inefficient, as it forces data frames to wait nearly an entire period before being forwarded, thereby significantly increasing end-to-end latency. To mitigate this, the model ensures that each episode lasts only a single time step, with one action taken per episode. As a result, scheduling frames along the entire path between end devices requires initiating a separate episode for each hop. Building on these principles, a scheduling Markov decision process (MDP) has been formulated, as detailed below.

5.1. State Space

For the scheduling model, the environment sends the following information to the agent:

The position identifier where the data flow has been scheduled at the previous TSN switch, referred to as the current position.
A vector containing information all possible positions. For each position, the corresponding value in the vector is 1 if there are sufficient resources to assign to the given data flow, or −1 if not. A position is considered to have enough free bytes for scheduling if all of the slots within that position have at least $l e n$ free bytes available. The length of the vector is equal to the maximum number of possible positions, which is computed using the hyperperiod and the frame periodicity. This results in a total of $h y p e r p e r i o d / s l o t_{t i m e}$ slots. Positions for smaller periods are fewer, so the initial values in the vector contain accurate information about all possible positions, while the remaining positions are padded with -1 until the vector reaches its full length.

5.2. Action Space

The action space in the scheduling model corresponds to the number of possible positions available for scheduling data frames, which matches the size of the positions vector defined in the state space. If no available positions exist to schedule frames, another action is determined to reject the service request. To improve the efficiency and stability of the learning process, filters have been applied to the action space. These filters are described as follows:

Non-existent positions filter: All positions that do not exist for the period $p r d$ specified in the service request are filtered out to prevent the agent from attempting to schedule frames at invalid positions.
Insufficient resources filter: All positions without enough resources to schedule the data flow described by the service request are also filtered out.

5.3. Reward Modeling

The reward function is designed to guide the agent toward minimizing the residence time at intermediate TSN switches. The reward function is expressed in Equation (16):

r_{n} = 10 δ - (p o s_{a c t} - p o s_{o p t}) m o d p r d,

(16)

where:

$p o s_{a c t}$ is the position selected by the agent.
$p o s_{o p t}$ is the optimal position that the agent should have chosen (i.e., the next available position starting from the current one).
$δ$ is a binary value set to 1 when $p o s_{a c t}$ and $p o s_{o p t}$ have the same value; otherwise, it is 0.

The agent is rewarded with 10 points when making optimal decisions (when

p o s_{a c t}

matches

p o s_{o p t}

) and penalized in the rest of the cases. The penalty increases with the distance between the selected position and the optimal position. The timing evolution is also considered, making the worst policy making the selection of the immediately previous position to the current one, as this leads to maximum queuing delays. Since the episode length is only one time step, there is no previous reward to consider in the calculation.

Several training rounds have been performed with more gradual penalties provided by exponential decay functions as formulated in Equation (17):

r_{n} = 10 δ - e^{((p o s_{a c t} - p o s_{o p t}) / a) m o d p r d},

(17)

where a is a scaling factor that regulates the slope of the exponential function. Results show that the model is unable to converge into an optimal solution for values of the parameter a lower than 4. In contrast, for values higher than 3, the model converges into solutions whose optimality is comparable to the levels provided by the models based on linear decay functions.

6. Evaluation

This section describes the training and evaluation of the previously defined models, presenting the results of this work. Before diving into specific details, the overall procedure for training and evaluating the models is outlined. Initially, each agent is trained independently from the other. This means that the scheduling of data frames is treated as transparent to the routing agent, and routing decisions are transparent to the scheduling agent. Once the training phase is complete, the trained models are jointly evaluated. For the evaluation, a jointly simulated environment has been developed. This environment sequentially interacts with both agents, sending states and receiving actions from each. All simulations were conducted on a computer equipped with a 3.30 GHz Intel Core-i9 processor and 64 GB of RAM.

6.1. Agent Description and Configuration

Both the routing and scheduling agents are implemented using the Rainbow algorithm [25], an enhanced version of the deep Q-network (DQN), which is one of the most popular value-based DRL algorithms. Rainbow incorporates several improvements to DQN, focusing on enhancements to the neural networks and the replay buffer. The deployed neural networks include noisy layers and a physical differentiation of value and advantage layers, following the Rainbow architecture. Additionally, Rainbow employs two copies of the same neural network for training and target evaluation. The replay buffer is also enhanced to prioritize experiences that provide greater learning value. This modified version is called Prioritized Experience Replay (PER). Moreover, the learning process is further enhanced by calculating distributions of expected Q-values instead of single discrete values. Additionally, a mechanism called n-step learning is introduced to assess how well it is taking any action in the long term. The noisy layers of the neural network force the agent to explore until it learns how to ignore the noise, so it is used as an alternative exploration strategy to

ϵ

-greedy. Moreover, an ADAM optimizer is used for updating the neural network. The complete set of hyperparameters configured for both agents is provided in Table 1.

A deeper description of all the hyperparameters is provided to facilitate the understanding of the selected values:

The replay buffer size represents the number of experiences that can be accumulated in the replay buffer. Experiences are used to train the agent as long as they are stored at the replay buffer. Once the replay buffer is full, the earliest gathered experiences are discarded. In total, 1,000,000 samples is a typical value, as larger ones do not usually provide better results, and smaller ones provoke slower training convergence.
The batch size is the amount of experiences that are used for training the agent at each time step. In general, as this value increases, training becomes faster but less accurate, so a trade-off must be established. Based on previous experience, we use a size of 32 samples.
The target network update periodicity configures how often the weights and biases of the online neural network are copied to the target neural network. As this value decreases, training becomes faster, but the risk of losing convergence increases. As in the case of the batch size, a trade-off must be achieved. We use a periodicity of 1000 time steps due to previous experience.
The discount factor modifies the relevance of long-term future rewards against short-term ones. Values for this hyperparameter are usually high, which is recommended when the last reward of the episode is very relevant to the agent. For this reason, we use a discount factor equal to 0.99.
The learning rate is the factor used to update the neural network weights during the backpropagation process. As an ADAM optimizer is used, the provided value is just the initial one, as the optimizer modifies it during training. It is recommended to use initial values between 0.0001 and 0.01, so we use the minimum value of this range.
The soft update coefficient is related to the percentage of weights not passed to the target network when updating it. This mechanism provides smoother variations in the target network, which, as a consequence, enhances learning stability. We use a soft update coefficient of 0.005 to benefit from this feature.
Both hyperparameters related to the PER mechanism ( $α$ and $β$ ) are used for giving priorities to those experiences from which the agent can learn better policies. While alpha is the hyperparameter that provides the desired priorities, it might over-prioritize some experiences, so beta is used as a scaling factor. Typical values for $α$ are between 0.4 and 0.6, and $β$ typically ranges from 0.3 and 0.5. We provide important levels of prioritization by choosing $α = 0.6$ and scaling those prioritizations with $β = 0.4$ .
The $ϵ$ value for the ADAM optimizer is a very small number that prevents the agent from exceptions related to dividing by zero. A typically used value is $10^{- 6}$ .
The n-step window size modifies the number of future rewards calculated by the n-step mechanism to evaluate the long-term effects of current actions. As this value increases, the computational cost also increases. However, better policies may be achieved. As in our environment, the episode length is short in general, we consider a size of three steps.
The margins of the distribution of Q-values are the minimum and maximum values considered in the distribution of returns in distributional DQN. Additionally, the atom size is the amount of values that are present in the distribution of returns. When this value is increased, computational costs also increase, but more sophisticated distributions are calculated, which makes the training process faster. Based on previous experience, we choose to define a distribution between −60 and 150 with 51 atoms.

An efficient neural network architecture is essential to balance the trade-off between learning capability and simplicity. Choosing a neural network that is too simple could result in inadequate learning, while carrying out the opposite could significantly decrease the learning speed due to the increased computational complexity. For this reason, each agent uses a neural network with a first standard layer consisting of 60 neurons, whose output is fully connected to a value branch with 60 neurons and an advantage branch with 60 more neurons. All subsequent layers are fully connected, and all neurons use the ReLU activation function.

6.2. DRL Agent Training

As previously described, the routing and scheduling agents are independently trained using the I2I topology shown in Figure 3, but some training-related concepts are common to both agents. First, all of the service requests between TFCs and other TFCs or TMSs have been artificially created and saved in a dataset. This allows for fair comparisons of model performance when changes are made to the implementation. Once the agent is trained, additional datasets are used to evaluate the created models. To simulate network load, a configurable number of background data flows is routed and scheduled at the beginning of each episode. This ensures the agents can find links or positions within the hyperperiod that lack sufficient resources, forcing them to explore other valid routes. Background traffic data flows share the same properties as regular flows: they are routed along the shortest available path and scheduled at the first available position starting from the current one. Background traffic data flows are described by the same types of service requests as regular data flows. Specifically, 300 background traffic data flows are generated at the beginning of each episode between all 13 nodes in the network. These background flows remain constant throughout the episode—no new service requests are generated, and no existing connections terminate. This behavior is based on the assumption that control channels between TFCs and TMSs are established as long-term connections with minimal variability over time. Each training round for both agents lasts 500,000 time steps, which takes approximately one hour to complete.

Several training rounds are launched for the routing agent to enable it to learn how to establish optimal paths between any pair of TSN switches. Figure 5 illustrates the evolution of rewards and losses experienced by the agent during the training process. The blue lines indicate the actual results gathered during training and the orange lines indicate averaged values. The figure shows a steady increase in the average reward as the agent continues to explore and learn. Eventually, the reward reaches a steady state, where it stabilizes at a more constant level. The final training reward of the routing model is 103.87 points, indicating that the agent learned optimally most of the routes. Regarding loss, it remains consistently low throughout the entire training process, highlighting the stability and efficiency of the training methodology.

For the scheduling agent, several training rounds are also launched. The objective is to enable the agent to be able to schedule data flows such that the residence time in intermediate nodes remains as low as possible. Figure 6 illustrates the evolution of the rewards and losses during the training process. The average rewards quickly increased and stabilized at a constant level above nine points. However, the actual evolution of rewards provides deeper insights. The minimum reward value reached since the early stages of training was around −1, indicating that when the agent fails to schedule a data flow optimally, it often selects the subsequent position, resulting in only minor penalties. This behavior demonstrates the agent’s ability to adapt and minimize penalties even in suboptimal scenarios. The final average reward is 9.23 points. Regarding loss, although it is significantly higher than in the routing training rounds, it becomes stable early in the training process, reflecting the model’s convergence and reliability.

6.3. Simulation Results

As previously described, the models that are trained are also evaluated to test their performance, with all evaluations using the pre-trained routing and scheduling models. The evaluations involved assigning resources to 1000 service requests and analyzing the results at the end of each episode. The joint evaluation of the routing and scheduling models follows a sequential decision-making process between the two agents. Given that scheduling states depend on routing actions (as shown in Figure 7), the evaluation begins by sending the first environmental state to the routing agent. The routing agent decides on the first link to form the path. Subsequently, the routing agent’s action is used to determine the position vector of the link where the routing agent is about to route data frames. The scheduling agent then decides which position should be used for scheduling data frames. Both actions are then sent back to the environment, allowing it to transition to the next state. This process is repeated until the service request is fully processed.

In terms of routing, Table 2 shows the obtained results. The rows represent all the routes originating from each TSN switch, while the columns represent all routes ending at each TSN switch. Each field is marked with the following symbols:

✓ indicates that the shortest path is always established.
∼ indicates that the data flows can reach their destination without following the shortest path but within the maximum delay bound.
× indicates that no path can be established within the maximum delay bound.

Table 2. Learning results for each possible route.

	0	1	2	3	4	5	6	7	8	9	10	11	12
$R_{s r c}$	0	1	2	3	4	5	6	7	8	9	10	11	12
0	-	✓	✓	✓	✓	✓	∼	∼	✓	∼	∼	✓	✓ ∼
1	✓	-	✓	✓	∼	✓	∼	∼	✓	∼	✓	✓	✓
2	✓	✓	-	✓	✓	∼	✓	∼	∼	✓	✓	✓	∼
3	✓	✓	✓	-	✓	✓	✓	∼	∼	✓	✓	✓	✓
4	✓	✓	✓	✓	-	∼	✓	∼	✓	✓	✓	∼	✓ ∼
5	✓	✓	∼×	✓	∼	-	✓	∼	∼	✓	✓	✓	✓
6	✓	✓	✓	✓	✓	✓	-	∼	∼	✓	✓	∼	∼
7	∼	∼	✓	✓	✓	✓	✓	-	∼	∼	∼×	✓	✓
8	✓	✓	✓	✓	✓	✓	∼	∼	-	∼	✓	✓	✓
9	✓	✓	✓	✓	✓	✓	✓	×	∼	-	✓	✓	✓
10	✓	✓	✓	✓	✓	✓ ∼	✓	×	∼	✓	-	✓	✓
11	✓	✓	✓	✓	∼	✓	✓	✓	✓	✓	✓	-	✓
12	✓	✓	✓	∼	✓	✓	∼	∼	✓	✓	✓	✓	-

✓ = optimal path established; ∼ = non-optimal path established; × = no path established.

In summary, there are 156 combinations of source and destination TSN switches, representing 156 optimal routes to learn. Among these, 111 are optimally learned (representing 71% of cases), 3 routes are partially learned optimally while still reaching their destinations within delay limits, 38 routes are established without following optimal paths but stay below the delay limits (24% of cases), 2 routes can only be partially established, and 2 routes cannot be established at all.

Regarding scheduling, 97% of operations are optimal, with the maximum error being one position, which adds an extra residence time equivalent to a single slot time (100 microseconds). The remaining 3% of operations are suboptimal because of the model’s performance limitations. However, the agent has learned a policy that is able to add the minimum possible sojourn time when it is not making optimal decisions. Figure 8 shows the cumulative distribution function (CDF) of relative delays after jointly evaluating the routing and the scheduling models. The deviation was calculated as the difference between the actual end-to-end delay (from the agents’ decisions) and the theoretical delay of the optimal route. Approximately 80% of data flows reach their destinations with the minimum possible delay, and 95% of flows experienced extra delays of 6 milliseconds or less. Most of the extra delays are multiples of 1 millisecond due to routing issues, while smaller increments, in multiples of 100 microseconds, were caused by scheduling issues. Overall, routing issues have a more significant impact on extra delay than scheduling issues.

The models that are created are tested with several background traffic patterns. Specifically, several evaluations with 600 and 900 background traffic data flows are performed. Regarding routing, adding more background traffic data flows does not affect the number of routes that are learned, as the routes that are not established keep very low values for all cases. The most significant changes are due to the congestion of optimal paths that is experienced in the network when the background traffic level is increased. In these cases, the agent is able to find alternative paths whose delay is more significant but below the maximum bounds established in service requests, which constitutes an important reliability feature for the developed solution. Regarding scheduling, the agent is able to perform similarly for any background traffic level, achieving 97% of optimal schedules for 300 background traffic data flows, 94% for 600 flows, and 93% for 900.

A larger topology with 24 nodes is used as a secondary scenario, depicted in Figure 9, where arrows represent unidirectional links between TFCs and TMSs, while the numbers next to the edges represent the delays

d_{i, j}

in milliseconds. The objective is to analyze the performance of both routing and scheduling models regarding scalability. Specifically, a new training sequence is launched, generating 1000 background data flows with random source and destination pairs at each episode. Regarding scheduling, no new models are required, as the ones that are trained for the smaller scenario are also valid for the big one: the routing decisions are transparent to the scheduling agent—it just learns how to schedule data frames over time given the resource availability information of a link, without knowing at which link it is scheduling frames. Regarding routing, there are 552 combinations of source and destination TSN switches, representing 552 optimal routes to learn. Among these, 464 are successfully established (representing 84% of cases). On the contrary, the remaining 88 routes (16%) can not be established. This scalability problem results in the proposed model performing less optimally for large scenarios. However, to solve this problem, a geographical multi-agent approach (where a routing and a scheduling agent are in charge of managing a subset of the network) could be addressed, which could be a possible future research topic.

It is crucial to note that, while end-to-end delays in the proposed solution may vary between different data flows originating from the same source and destination nodes, the periodicity of frame arrivals at the destination is guaranteed, ensuring the provision of synchronous services across the network. The guarantee is achieved by the definition of positions, which forces the scheduler to allocate the required resources to all designated slots to maintain the expected periodicity. Furthermore, since time is slotted and periodicity is preserved, the maximum jitter in this solution is limited to the duration of a slot, which is 100 microseconds. The proposed MARL solution demonstrates that the two agents manage the TSN network automatically and in real time, creating optimal routes for synchronous flows with an end-to-end latency variation of less than 6 ms and a jitter of fewer than 100 microseconds, achieving an effectiveness of over 95%.

7. Conclusions

In conclusion, we proposed an architecture for the next generation of smart roadside infrastructure, where communications between infrastructure devices (TFCs, TMSs, etc.) rely on TSN and are managed by a centralized control plane. The routing and scheduling assignment problem was addressed using a multi-agent DRL-based online solution that is flexible, scalable, and reliable. Service channel management supports multiple service operators and operates automatically, without requiring intervention from network or service managers. TSN is poised to play a crucial role in converging diverse networks into a single technology (IP over Ethernet), and its integration with 5G networks will further enhance flexibility and scalability. Future research involves assuming full 5G-TSN integration to implement V2I communications, potentially involving multiple TSN control planes. Additionally, addressing the coexistence of best-effort and time-critical traffic will be an important next step.

Author Contributions

Conceptualization, S.G.-C., C.R.d.M., C.C.-P. and S.S.; methodology, S.G.-C., C.R.d.M., C.C.-P. and S.S.; software, S.G.-C. and C.R.d.M.; validation, S.G.-C., C.R.d.M., C.C.-P. and S.S.; formal analysis, S.G.-C. and C.R.d.M.; investigation, S.G.-C., C.R.d.M., C.C.-P. and S.S.; resources, C.C.-P. and S.S.; data curation, S.G.-C. and C.R.d.M.; writing—original draft preparation, S.G.-C.; writing—review and editing, S.G.-C., C.R.d.M., C.C.-P. and S.S.; visualization, S.G.-C.; supervision, C.C.-P. and S.S.; project administration, C.C.-P. and S.S.; funding acquisition, C.C.-P. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Spanish Ministry of Economic Affairs and Digital Transformation and the European Union—NextGenerationEU, in the framework of the Recovery Plan, Transformation and Resilience (PRTR) (Call UNICO I+D 5G 2021, ref. number TSI-063000-2021-15–6GSMART-EZ), as well as by the Agencia Estatal de Investigación of Ministerio de Ciencia e Innovación of Spain under project PID2022-137329OB-C41/MCIN/AEI/10.13039/50110001103.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The software presented in this study is publicly available in a GitHub repository: https://github.com/sergigarciacanton/AI_MARL_CNC_Implementation (accessed on 24 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ADAM	Adaptive Moment Estimator
CAN	Controller Area Network
CNC	Centralized Network Configuration
CUC	Centralized User Configuration
DQN	Deep Q-Networks
DRL	Deep Reinforcement Learning
I2I	Infrastructure-to-Infrastructure
ILP	Integer Linear Programming
IoV	Internet of Vehicles
MARL	Multi-Agent Reinforcement Learning
MDP	Markov Decision Process
ML	Machine Learning
PER	Prioritized Experience Replay
QoS	Quality of Service
RL	Reinforcement Learning
RSCE	Roadside Cabinet Electronics
RSU	Roadside Unit
SCMS	Secure Credential Management System
SDN	Software-Defined Network
TAS	Time-Aware Shaper
TFC	Transportation Field Cabinet
TMC	Traffic Management Center
TMS	Traffic Management System
TSN	Time-Sensitive Networking
V2I	Vehicle-to-Infrastructure
V2V	Vehicle-to-Vehicle

References

Zhou, Z.; Wei, J.; Luo, Y.; Clark, K.A.; Sillekens, E.; Deakin, C.; Sohanpal, R.; Slavík, R.; Liu, Z. Communications with guaranteed bandwidth and low latency using frequency-referenced multiplexing. Nat. Electron. 2023, 9, 694–702. [Google Scholar] [CrossRef]
Benalia, E.; Bitam, S.; Mellouk, A. Data dissemination for Internet of vehicle based on 5G communications: A survey. Trans. Emerg. Telecommun. Technol. 2020, 31, 1–34. [Google Scholar] [CrossRef]
Pisarov, J.; Mester, G. The Future of Autonomous Vehicles. Fme Trans. 2020, 49, 29–35. [Google Scholar] [CrossRef]
Peng, Y.; Shi, B.; Jiang, T.; Tu, X.; Xu, D.; Hua, K. A Survey on In-Vehicle Time-Sensitive Networking. IEEE Internet Things J. 2023, 10, 14375–14396. [Google Scholar] [CrossRef]
Brunner, S.; Roder, J.; Kucera, M.; Waas, T. Automotive E/E-architecture enhancements by usage of Ethernet TSN. In Proceedings of the 2017 13th Workshop on Intelligent Solutions in Embedded Systems (WISES), Hamburg, Germany, 9–13 June 2017. [Google Scholar] [CrossRef]
Ashjaei, M.; Lo Bello, L.; Daneshtalab, M.; Patti, G.; Saponara, S.; Mubeen, S. Time-Sensitive Networking in automotive embedded systems: State of the art and research opportunities. J. Syst. Archit. 2021, 117, 1–15. [Google Scholar] [CrossRef]
RSU Standardization Working Group. Connected Transportation Interoperability (CTI) Roadside Unit (RSU) Standard (CTI 4001 v01.01—Amendment 1). A Connected Intersection-Ready Standard of AASHTO, ITE, NEMA, and SAE International. September 2022. Available online: https://www.ite.org/ITEORG/assets/File/Standards/CTI%204001v0101-amended.pdf (accessed on 9 December 2024).
Guerna, A.; Bitam, S.; Calafate, C.T. Roadside Unit Deployment in Internet of Vehicles Systems: A Survey. Sensors 2022, 22, 3190. [Google Scholar] [CrossRef] [PubMed]
Kombate, D.; Wanglina. The Internet of Vehicles Based on 5G Communications. In Proceedings of the 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Chengdu, China, 15–18 December 2016; pp. 445–448. [Google Scholar] [CrossRef]
Reis, A.B.; Sargento, S.; Neves, F.; Tonguz, O.K. Deploying Roadside Units in Sparse Vehicular Networks: What Really Works and What Does Not. IEEE Trans. Veh. Technol. 2014, 63, 2794–2806. [Google Scholar] [CrossRef]
Liu, K.; Xu, X.; Chen, M.; Liu, B.; Wu, L.; Lee, V.C.S. A Hierarchical Architecture for the Future Internet of Vehicles. IEEE Commun. Mag. 2019, 57, 41–47. [Google Scholar] [CrossRef]
Li, B.; Zhu, Y.; Liu, Q.; Yao, X. Development of Deterministic Communication for In-Vehicle Networks Based on Software-Defined Time-Sensitive Networking. Machines 2024, 12, 816. [Google Scholar] [CrossRef]
IEEE 802.1 Higher Layer LAN Protocols Working Group. Draft Standard for Local and Metropolitan Area Networks—Time-Sensitive Networking Profile for Automotive In-Vehicle Ethernet Communications. Available online: https://1.ieee802.org/tsn/802-1dg/ (accessed on 9 December 2024). Approved: 08 Feb 2019; Extended: 02 Nov 2023; Expires: 31 Dec 2025.
Sasiain, J.; Franco, D.; Atutxa, A.; Astorga, J.; Jacob, E. Towards the Integration and Convergence Between 5G and TSN Technologies and Architectures for Industrial Communications: A Survey. IEEE Commun. Surv. Tutorials 2024, 1–64. [Google Scholar] [CrossRef]
Li, Y.; Jiang, J.; Hong, S.H. Joint Traffic Routing and Scheduling Algorithm Eliminating the Nondeterministic Interruption for TSN Networks Used in IIoT. IEEE Internet Things J. 2022, 9, 18663–18680. [Google Scholar] [CrossRef]
Von Arnim, C.; Drǎgan, M.; Frick, F.; Lechler, A.; Riedel, O.; Verl, A. TSN-based Converged Industrial Networks: Evolutionary Steps and Migration Paths. In Proceedings of the 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 September 2020; pp. 294–301. [Google Scholar] [CrossRef]
Häckel, T.; Meyer, P.; Korf, F.; Schmidt, T.C. Secure Time-Sensitive Software-Defined Networking in Vehicles. IEEE Trans. Veh. Technol. 2023, 72, 35–51. [Google Scholar] [CrossRef]
Salahuddin, M.A.; Al-Fuqaha, A.; Guizani, M. Software-Defined Networking for RSU Clouds in Support of the Internet of Vehicles. IEEE Internet Things J. 2015, 2, 133–144. [Google Scholar] [CrossRef]
Ni, Y.; He, J.; Cai, L.; Pan, J.; Bo, Y. Joint Roadside Unit Deployment and Service Task Assignment for Internet of Vehicles (IoV). IEEE Internet Things J. 2019, 6, 3271–3283. [Google Scholar] [CrossRef]
Cao, B.; Sun, Z.; Zhang, J.; Gu, Y. Resource Allocation in 5G IoV Architecture Based on SDN and Fog-Cloud Computing. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3832–3840. [Google Scholar] [CrossRef]
Arbelaez, A.; Mehta, D.; O’Sullivan, B.; Quesada, L. A constraint-based parallel local search for the edge-disjoint rooted distance-constrained minimum spanning tree problem. J. Heuristics 2018, 24, 359–394. [Google Scholar] [CrossRef]
Syed, A.A.; Ayaz, S.; Leinmüller, T.; Chandra, M. MIP-based Joint Scheduling and Routing with Load Balancing for TSN based In-vehicle Networks. In Proceedings of the 2020 IEEE Vehicular Networking Conference (VNC), Virtual, 16–18 December 2020; pp. 1–7. [Google Scholar] [CrossRef]
Falk, J.; Dürr, F.; Rothermel, K. Exploring Practical Limitations of Joint Routing and Scheduling for TSN with ILP. In Proceedings of the 2018 IEEE 24th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Hakodate, Japan, 28–31 August 2018; pp. 136–146. [Google Scholar] [CrossRef]
Yang, L.; Wei, Y.; Yu, F.R.; Han, Z. Joint Routing and Scheduling Optimization in Time-Sensitive Networks Using Graph-Convolutional-Network-Based Deep Reinforcement Learning. IEEE Internet Things J. 2022, 9, 23981–23994. [Google Scholar] [CrossRef]
Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining Improvements in Deep Reinforcement Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; 2017; 32, pp. 3215–3222. [Google Scholar] [CrossRef]

Figure 1. Proposed model for a roadside transportation field controller (TFC).

Figure 2. Proposed architecture for the next generation of smart roadside infrastructure.

Figure 3. Simulation scenario consisting of a set of TSN switches located in the TFC and TMS entities connected by unidirectional fixed network links that make up a service area of several service providers.

Figure 4. The proposed temporal division model based on a hyperperiod and several slots with a fixed size.

Figure 5. The reward function over episodes (left plot) and the loss function over steps (right plot) of a routing model during training.

Figure 6. The reward function over episodes (left plot) and the loss function over steps (right plot) of a routing model during training.

Figure 7. MDP-related information flows between agents and the environment in joint evaluations.

Figure 8. Distribution of relative end-to-end delays compared with the optimal ones.

Figure 9. Larger scenario used for evaluating models’ scalability.

Table 1. Hyperparameter settings for both routing and scheduling agents.

Hyperparameter	Value
Replay buffer size	1,000,000 experiences
Batch size	32 experiences
Target network update periodicity	1000 time steps
Discount factor ( $γ$ )	0.99
Learning rate	0.0001
Soft update coefficient ( $τ$ )	0.005
PER alpha ( $α$ )	0.6
PER beta ( $β$ )	0.4
ADAM epsilon ( $ϵ$ )	$10^{- 6}$
N-step window size	3
Q values distribution margin	[−60, 150]
Atom size	51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Garcia-Cantón, S.; Ruiz de Mendoza, C.; Cervelló-Pastor, C.; Sallent, S. Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets. Appl. Sci. 2025, 15, 1122. https://doi.org/10.3390/app15031122

AMA Style

Garcia-Cantón S, Ruiz de Mendoza C, Cervelló-Pastor C, Sallent S. Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets. Applied Sciences. 2025; 15(3):1122. https://doi.org/10.3390/app15031122

Chicago/Turabian Style

Garcia-Cantón, Sergi, Carlos Ruiz de Mendoza, Cristina Cervelló-Pastor, and Sebastià Sallent. 2025. "Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets" Applied Sciences 15, no. 3: 1122. https://doi.org/10.3390/app15031122

APA Style

Garcia-Cantón, S., Ruiz de Mendoza, C., Cervelló-Pastor, C., & Sallent, S. (2025). Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets. Applied Sciences, 15(3), 1122. https://doi.org/10.3390/app15031122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets

Abstract

1. Introduction

2. Related Work

3. Problem Definition

ILP Formulation

4. MARL-Based Routing Solution

4.1. Action Space

4.2. State Space

4.3. Reward Modeling

5. MARL-Based Scheduling Solution

5.1. State Space

5.2. Action Space

5.3. Reward Modeling

6. Evaluation

6.1. Agent Description and Configuration

6.2. DRL Agent Training

6.3. Simulation Results

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI