1. Introduction
Modern communication systems are transforming vehicles into increasingly connected devices, with projections indicating that by 2030, there will be over 645 million connected cars worldwide [
1]. These connections support the deployment of various services, including infotainment, cooperative traffic control, navigation, and safety features [
2], all aimed at enhancing the experience of drivers and passengers. Looking ahead, vehicles are envisioned as fully autonomous systems capable of transporting people or goods by simply receiving instructions at the start of a journey. In this paradigm, the task of driving will be entirely eliminated, removing the need for human drivers to control the vehicle’s behavior. Autonomous vehicles are expected to enhance road safety for pedestrians, optimize the utilization of smart roadside infrastructure, and significantly improve the accessibility to transportation options for individuals such as the elderly, children, or those with injuries [
3]. Vehicles will achieve autonomy by maintaining constant communication with surrounding vehicles (V2V, vehicle-to-vehicle communications) and the supporting infrastructure. These secure, real-time communications must be highly reliable, synchronized, and characterized by low latency. Collectively, these interactions are encompassed within the Internet of Vehicles (IoV) framework.
Autonomous vehicles will also require secure, efficient, and reliable internal communication systems to facilitate data exchange among sensors, cameras, and controllers. Traditional in-vehicle communication systems rely on proprietary solutions, which are heavily manufacturer-dependent, suffer from interoperability challenges, and fail to meet the bandwidth demands of modern devices [
4]. These limitations have prompted vehicle manufacturers and researchers to explore the feasibility of transitioning to solutions based on open standards [
5]. One such proposed standard for implementing real-time systems is time-sensitive networking (TSN), a suite of IEEE standards designed to enable ultra-reliable, real-time, and cost-effective communications over Ethernet [
6]. Ethernet is widely recognized for its flexibility and extensive adoption, making it an ideal foundation for TSN-based systems.
Future autonomous vehicles are anticipated to connect to traffic infrastructure—and by extension, the Internet—via components known as roadside units (RSUs) [
7]. These RSUs are directly linked to roadside cabinet electronics (RSCE) systems housed within transportation field cabinets (TFCs). Equipped with computing capabilities, TFCs are responsible for collecting and analyzing traffic data transmitted by both vehicles and infrastructure devices [
8]. Additionally, they can also be configured to manage local traffic by using traffic management systems (TMSs). In the IoV framework, the communication between vehicles and RSUs is referred to as vehicle to infrastructure (V2I) communication. Many researchers advocate for implementing V2I communications using 5G technology due to its advanced capabilities [
9]. RSCEs must be interconnected to enable the seamless exchange of traffic information, forming the basis of infrastructure-to-infrastructure communications (I2I). These communications must ensure reliability, synchronization, and the capability to transmit vehicle data to other vehicles, traffic management centers (TMCs), and the Internet. [
10]. The network topology connecting TFCs can vary significantly depending on the distribution and density of roads. Consequently, the communication technology used between infrastructure devices must be both scalable and flexible to accommodate these variations [
11]. Unlike in in-vehicle communications, where network topologies are relatively simple (e.g., rings or buses) and involve a limited number of devices [
12], I2I communications present a much more complex optimization challenge.
Considering (i) the ongoing advancements in designing TSN deployments for in-vehicle communications [
13], (ii) the similarity in requirements between in-vehicle and I2I communications [
4,
11], and (iii) the capability to interconnect TSN networks via 5G [
14], we propose adopting a smart infrastructure based on deep reinforcement learning (DRL) and TSN standards. This framework aims to facilitate efficient and reliable communication between IoV infrastructure devices, including RSUs, RSCEs and TMCs from different TFCs. The structure of TSN networks typically adheres to a software-defined networking (SDN) approach, enabling centralized management to achieve the necessary levels of scalability and flexibility. In this proposal, the introduction of DRL facilitates the automatic and real-time creation of reconfigurable, reliable, synchronous, low-latency, and secure service channels between the road infrastructure and the various TMCs of providers. The proposed mechanism manages online the arrival of requests for flow creation associated with service channels. As an entity demands a request for a synchronous flow between two entities, the system creates the optimal route in real time, keeping the flow synchronization limited to a maximum end-to-end communication delay. However, this solution introduces significant management challenges, as creating optimal paths between devices is a computationally complex problem involving both routing and scheduling [
15]. This article describes the design, implementation, and evaluation of an automated management model for I2I service channels based on multi-agent reinforcement learning (MARL) frameworks using DRL. These models efficiently manage the routing and scheduling of data frames between IoV infrastructure devices through TSN networks, enabling real-time, synchronous I2I communications. Routing and scheduling models are developed, and a simulated shared environment is created to evaluate the behavior of both agents, which operate within the TSN control plane. The choice of TSN for communications between infrastructure devices ensures the required levels of quality of service (QoS) to support critical applications, such as traffic congestion management or broadcasting accident-related alerts. Additionally, this approach facilitates the deployment of TSN for all wired IoV communications and enables the integration of 5G and TSN for V2I communications. However, the detailed study of V2I communications leveraging 5G networks and TSN is identified as an area for future work.
This work makes two primary contributions:
- 1.
The development of a critical communications network: A TSN-based framework is proposed to interconnect all components of I2I communications, serving as the backbone for future driverless car services. This network achieves ultra-low response times of less than one millisecond, ensuring the reliability and efficiency required for autonomous vehicle operations.
- 2.
The automated real-time management of services: A system leveraging DRL and multi-agent techniques is introduced to manage service demands across network elements (V2I, I2I, and TMC) in real time. This approach eliminates human intervention while ensuring optimal performance and scalability.
The rest of this article is organized as follows:
Section 2 introduces the research context and reviews related works.
Section 3 defines the problem formulation for integer linear programming (ILP) and DRL models for a joint routing and scheduling solution.
Section 4 and
Section 5 detail the proposed routing and scheduling solutions, respectively.
Section 6 presents the evaluation methodology and results for the models. Finally,
Section 7 concludes with a summary of findings and their implications.
2. Related Work
One of the primary applications of TSN is in the industrial sector, where it is anticipated to play a pivotal role in advancing Industry 4.0. In [
16], several migration strategies were proposed to transition from existing proprietary industrial networks to a unified TSN-based architecture. The study emphasized that this migration should be implemented gradually to mitigate potential disruptions and accommodate the complexities of the transition. In the automotive sector, ref. [
5] proposed several architectures for in-vehicle networks, emphasizing the significant advantages of utilizing TSN in these scenarios. The authors concluded their study by underscoring the necessity of implementing architectural modifications in in-vehicle networks to fully leverage the capabilities of TSN. The authors in [
17] proposed a TSN-based SDN architecture for in-vehicle communications and validated its performance on a commercial car. The study highlighted the solution’s capability to manage both time-critical and best-effort traffic effectively. The architecture assumes a zoned topology, enabling the coexistence of Ethernet and controller area network (CAN) protocols by implementing gateways at each zone. In this setup, sensors continue to communicate using the CAN protocol, while the core of the network operates on Ethernet, ensuring improved scalability and performance. In [
14], a comprehensive review of the technologies enabling the integration of 5G networks with TSN is presented, offering a detailed summary of the state-of-the-art literature on this topic. The integration of 5G and TSN holds significant potential, particularly in scenarios where both in-vehicle communication networks and infrastructure communications are TSN-based. This convergence promises to enhance network performance, ensuring ultra-reliable, low-latency communication critical for emerging automotive and industrial applications.
Research on RSUs mainly focuses on three key areas: the optimal placement of RSUs along roads, the optimal resource allocation for autonomous vehicles, and the development of innovative architectures for the IoV. The authors in [
10] analyzed the effectiveness of an RSU deployment in a highway scenario, comparing independent RSUs with interconnected RSUs. They concluded that deploying disconnected RSUs offers minimal improvement in reducing message dissemination delays. In contrast, interconnected RSUs significantly enhance performance, reducing delays by orders of magnitude. The researchers in [
18] developed an RSU cloud infrastructure that provides computation capabilities, which are utilized by connected vehicles as virtual machines. To optimize the system, they proposed a multi-objective ILP model aimed at minimizing the infrastructure delay, the number of deployed RSUs, and the frequency of virtual machine reconfigurations. The model tries to keep virtual machines as close as possible to the connected vehicle they are serving. Furthermore, a reinforcement learning (RL) agent was implemented to ensure that the number of reconfigurations is minimized over the long term, enhancing the overall system efficiency. In [
19], the authors proposed an ILP method for the joint deployment of RSUs and the assignment of vehicles’ service tasks to the deployed RSUs. The authors assumed the allocation of RSUs in crossroads and defined service areas along the surrounding roads, ensuring overlapping coverage between neighboring RSUs. The model incorporates deployment and maintenance costs as factors and enforces a maximum end-to-end delay as a constraint. The authors of [
20] proposed a 5G-based IoV architecture based on SDN and fog-cloud computing. RSUs are grouped in fog clusters managed by an RSU controller, which is connected to a centralized SDN controller overseeing multiple fog clusters. The researchers proposed leveraging 5G base stations for V2I communications as a fallback when RSUs become congested. They also presented an optimization model aimed at minimizing service delay and energy consumption while ensuring load balancing across the network.
To the best of our knowledge, no existing literature addresses the DRL-based optimization of TSN routing and scheduling for IoV and I2I communication networks. Furthermore, the solution proposed in this study is the first to employ MARL to tackle TSN’s routing and scheduling challenges. A related study, ref. [
12], developed an ILP-based routing and scheduling model for TSN-SDN in-vehicle networks considering some vulnerabilities in the Time-Aware Shaper (TAS)—a critical traffic shaper for TSN networks. The proposed approach mitigated TAS vulnerabilities, making it applicable to in-vehicle networks. Ref. [
21] proposes an algorithm that solves Edge Disjoint Rooted Distance-Constrained Minimum Spanning trees problem with a limited budget. The proposal develops two local operators, which obtain disjoint routes between clients and facilities with budget limits based on a local search algorithm ensuring fault tolerance. The algorithm requires that all requests between clients and facilities are initially known. In [
22], the authors propose a static routing and scheduling algorithm based on mixed integer programming (MIP) in the in-vehicle network. The algorithm controls the TSN Gate Control Lists by simultaneously optimizing the routing and scheduling of dynamic traffic as the vehicle drives off. The work focuses on the static scheduling of load-balanced flows across TSN slots by optimizing dynamic requests. This offline model is evaluated for a small number of nodes with a priori knowledge of the demands. Similarly, ref. [
23] evaluated an ILP-based routing and scheduling model for TSN networks, analyzing its performance under increasing network sizes and the traffic loads across different topologies. While the schedules produced were near-optimal, the high computational runtime rendered the model unsuitable for real-time scenarios, particularly in automotive environments. In contrast, ref. [
24] proposed a DRL-based model to solve the routing problem in TSN networks. A key innovation of their work was the integration of graph convolutional networks within the DRL agent. The model accounted for the coexistence of time-critical and best-effort traffic by using multiple priority-based queues. However, the DRL agent did not address the assignment of data flows to these priority queues, leaving a gap in the comprehensive handling of TSN scheduling.
3. Problem Definition
This section outlines the modeling of the infrastructure environment and the problems addressed in this study. Consistent with previous works on RSU deployment [
19], we assume RSUs are deployed at urban crossroads. In line with established standards [
7], it is further assumed that all the RSUs at a given crossroad are connected to a TFC equipped with a RSCE unit, forming a star topology. The TFCs are interconnected to ensure multiple paths between any pair of nodes, providing enhanced reliability in terms of connectivity. In the proposed model, each TFC is equipped with a TSN switch to enable TSN functionalities. Additionally, each RSU is equipped with cellular or wireless communication technology and a global navigation satellite system (GNSS), which are utilized to establish V2I communications.
Figure 1 illustrates the proposed infrastructure architecture at a crossroad. By deploying synchronous network systems such as TSNs, the model achieves very high levels of reliability and ultralow delays, with submillisecond latency.
The I2I infrastructure consists of a network of interconnected TFC nodes, equipped with a secure credential management system (SCMS) and integrated with the TMS of various service operators via a high-capacity TSN fixed network. Each node of the I2I network features a TSN switch, enabling the creation of low-latency service channels between the TFC nodes and their corresponding service operators. These channels are designed to be reliable and secure, ensuring that the controlled traffic between different service providers remains independent. The proposed architecture is based on a centralized TSN network governed by a control plane. This control plane is managed by MARL algorithms, which automate the processes for creating, maintaining, and closing the service channels of the data plane. This real-time operation eliminates the need for human intervention, ensuring seamless and efficient management of the infrastructure.
As shown in
Figure 2, the communication control plane is composed of a centralized user configuration (CUC) module and a centralized network configuration (CNC) module. The CUC module is responsible for handling end-user service requests and processing basic network parameters. The CNC module, on the other hand, receives these service requests from the CUC and configures the TSN switches accordingly to meet the user demands. Our proposed models are designed to be integrated into the CNC module, enabling it to efficiently manage the communication data plane in response to dynamic service requests over time. This integration ensures seamless operation and adaptability to changing network requirements, leveraging the CNC module’s centralized control capabilities.
The TSN network connecting all of the infrastructure devices is modeled as a directed graph
, where
V represents the set of TSN switches located within TFCs and TMSs, and E denotes the set of directed links connecting these entities. The cumulative delay introduced by an edge from source node
i to destination
j is defined as
, which accounts for transmission, propagation, and queuing delays. One of the simulated topologies is depicted in
Figure 3. In the figure, arrows indicate unidirectional edges between TFCs and TMSs, while the numbers next to the edges represent the delays
in milliseconds. Although the figure focuses on the data plane, it is essential to note that entities of the control plane are also included in this study, following the centralized approach described earlier.
The TSN architecture provides an infrastructure capable of handling critical services, ensuring that information frames are transmitted synchronously with a transmission period chosen by the transmitting and receiving entities through the CUC module. The proposed model ensures that whenever a vehicle or any network element needs to transfer critical information, a free and synchronous channel is always available. This channel guarantees a constant transmission rate with minimal transmission delay and limited delay variation between entities, ensuring reliable and efficient communications.
This synchronous flow of control data establishes a service channel, defined as a TSN unidirectional transmission of data frames from a source device, referred to as a talker, to a destination device, referred to as a listener. Connections between RSCEs and their corresponding service providers managing the TMS are assumed to facilitate the exchange of control data. These data must adhere to strict delivery requirements, including maximum delay and jitter bounds. These connections are periodic and have fixed frame sizes. A service request sent from an end device, such as a TFC or a TMS, to the CUC module is based on a tuple: (, , , , ). For each parameter in this tuple, the following assumptions are made:
and are integer identifiers for the TFCs and TMS where the source and destination devices are directly connected, with .
represents the frame size in bytes, with possible values .
is the frame transmission period in milliseconds, with .
is the maximum acceptable end-to-end delay in milliseconds, with .
The proposed scenario is inspired by a subset of streets and crossroads of downtown Barcelona. Then, the service requests that are randomly generated are based on information about aggregated traffic.
Moreover, some concepts related to time division and link scheduling are defined. Since data frames are sent periodically, the scheduling of frames is also periodic. Given the multiple possible period values, the overall period, referred to as the
hyperperiod, is the least common multiple (LCM) of all period values. The data flow scheduling along a hyperperiod follows a consistent pattern unless a new service request is received or an existing connection stops sending frames. The hyperperiod is divided into equally long slots of 100 microseconds, referred to as the
slot time). Assuming that all network links operate at 1 Gbps, it is possible to transmit 12,500 bytes in each slot, which is termed the
slot size). These concepts—hyperperiod, slot time, and slot size—are illustrated in
Figure 4.
To simplify the scheduling of data flows with specific periodicities, we introduce the concept of positions, which represent groups of slots that can be used for transmitting a data flow. If the hyperperiod is divided into groups of slots, the total duration of each group is equal to . A data flow whose periodicity is can be transmitted by selecting the same time-ordered slot within each group. These positions satisfy the following property: their slots are equally spaced by milliseconds. As a result, selecting a single position ensures that the periodicity between consecutive data frames is maintained within the time duration of a slot, providing a straightforward and consistent mechanism for periodic data flow scheduling. The definition of positions depends on the period of the service request, meaning that the number of possible positions for scheduling frames is different with each value.
With the architecture of critical I2I communications now defined, we propose a multi-agent solution based on DRL to identify and maintain optimal routes while performing optimal scheduling for the requested traffic. This approach offers the significant advantage of delivering optimal flow services in real time without requiring intervention from network operators or service providers. The following sections describe the implementation of the routing and scheduling solutions, which are built using multi-agent value-based DRL agents. Each solution is modeled as a Markov decision process (MDP) to formalize the decision-making framework.
ILP Formulation
This section provides the mathematical formulation for managing the routing and scheduling of data frames between IoV infrastructure devices through TSN to ensure real-time synchronous communications. This problem could be defined as an ILP model to determine the routing and periodic slots for each service request arrival to meet its requirements. The objective is to minimize the end-to-end delay in transmitting data frames. The ILP is called the One-Request Joint Routing and Slot-Capacity-Aware Flow Placement (One-Req-RSCAFP) problem.
As defined in the previous section, the TSN network scenario is modeled by a directed weighted graph , adopting a specified topology, being the set of nodes, and the set of edges. The nodes are the TSN switches inside the TFCs and TMSs. Each link , being , is characterized by a delay , which allows us to evaluate and optimize the delay performance of the transmissions under various conditions. Each link is slotted with a hyperperiod S that emulates the TSN synchronous network. Moreover, let be the capacity of the link slot s to allow communication on the network into a fixed length, repeating time cycles determined by the period of the service request.
Let consider a service request R defined with the tuple (, , , , ).
Thus, this work defines an iterative procedure that uses the model One-Req-RSCAFP at each request arrival determining a route with enough periodic slot capacity considering different background traffic levels. Next, having the optimal valid route, the final scheduling is determined using a simple heuristic.
To formulate the One-Req-RSCAFP problem, let us define a binary variable
indicating if a link
is part of the constrained route of the service request:
and let
be a binary variable specifying if a slot
s of the link
is used for the transmission:
With all of these considerations, the ILP for the One-Req-RSCAFP problem is:
The objective given in (3) aims to minimize the total cost delay. Constraints (4) and (5) formulate the flow conservation property at the slot level needed to ensure a correct path leading to the destination, the first for the source node and the second for other nodes in the network. Constraint (6) ensures that the data frame transmitted on a slot does not exceed its capacity.
Constraint (7) determines that if a link starting from the request’s origin is active (thus,
is a link of the path used) has a slot, s, with sufficient periodic capacity within the link, the corresponding variable
is set to 1. This constraint is non-linear, but it can be linearized, being equivalent to the following linear restriction:
being
and
s a valid slot, i.e., a slot from which periodic enough capacity exists.
Constraint (8) considers two contiguous links,
and
, which are part of the path to be followed by the request, in which the first link has a slot,
s, with sufficient periodic capacity, that is,
. In this situation, the second link must have a slot from which there is enough capacity, according to the periodicity of the request, at the same position as the first link or later
, i.e.,
. This constraint is also nonlinear and, through linearization techniques, is equivalent to the following:
being
a link with initial node
i different of the origin
,
and
.
Constraint (9) ensures that a slot on an inactive link (i.e., not being part of the resulting path) is not occupied. In addition, constraint (10) guarantees that the total path delay does not exceed the maximum acceptable end-to-end delay of the request.
Finally, all variables are defined as binary in constraints (11) and (12).
The ILP requires a high computation time to obtain a solution, making its use in real time unfeasible. For example, in a scenario with 600 to 900 background flows, the average computation time required to obtain a result with the ILP is 42.25 ms, while the time spent on the proposed solution is about 4 ms.
For this reason, this work chooses a method based on MARL integrated with DRL. Once the trained model is obtained, this method allows for finding a time-efficient solution close to the optimum and with low computational complexity.
5. MARL-Based Scheduling Solution
Once the routing agent identifies the next link for traffic scheduling, the scheduling agent takes responsibility for assigning data frames to time slots, aiming to minimize buffering time at intermediate TSN switches. The primary objective is to select positions that allow data frames arriving at an intermediate node to be forwarded immediately, avoiding queuing delays. This process is based on the following assumption: if node n schedules a data flow at position i, the frames will arrive at node at a time slot corresponding to position i. Ideally, the optimal policy is to maintain consistency by scheduling data frames in the same position as the previous link whenever possible. If resource constraints make this infeasible, the next available position in the temporal domain should be selected, following the minimum residence time criterion. Conversely, selecting a position immediately before the optimal one in the timing domain is highly inefficient, as it forces data frames to wait nearly an entire period before being forwarded, thereby significantly increasing end-to-end latency. To mitigate this, the model ensures that each episode lasts only a single time step, with one action taken per episode. As a result, scheduling frames along the entire path between end devices requires initiating a separate episode for each hop. Building on these principles, a scheduling Markov decision process (MDP) has been formulated, as detailed below.
5.1. State Space
For the scheduling model, the environment sends the following information to the agent:
The position identifier where the data flow has been scheduled at the previous TSN switch, referred to as the current position.
A vector containing information all possible positions. For each position, the corresponding value in the vector is 1 if there are sufficient resources to assign to the given data flow, or −1 if not. A position is considered to have enough free bytes for scheduling if all of the slots within that position have at least free bytes available. The length of the vector is equal to the maximum number of possible positions, which is computed using the hyperperiod and the frame periodicity. This results in a total of slots. Positions for smaller periods are fewer, so the initial values in the vector contain accurate information about all possible positions, while the remaining positions are padded with -1 until the vector reaches its full length.
5.2. Action Space
The action space in the scheduling model corresponds to the number of possible positions available for scheduling data frames, which matches the size of the positions vector defined in the state space. If no available positions exist to schedule frames, another action is determined to reject the service request. To improve the efficiency and stability of the learning process, filters have been applied to the action space. These filters are described as follows:
Non-existent positions filter: All positions that do not exist for the period specified in the service request are filtered out to prevent the agent from attempting to schedule frames at invalid positions.
Insufficient resources filter: All positions without enough resources to schedule the data flow described by the service request are also filtered out.
5.3. Reward Modeling
The reward function is designed to guide the agent toward minimizing the residence time at intermediate TSN switches. The reward function is expressed in Equation (16):
where:
is the position selected by the agent.
is the optimal position that the agent should have chosen (i.e., the next available position starting from the current one).
is a binary value set to 1 when and have the same value; otherwise, it is 0.
The agent is rewarded with 10 points when making optimal decisions (when matches ) and penalized in the rest of the cases. The penalty increases with the distance between the selected position and the optimal position. The timing evolution is also considered, making the worst policy making the selection of the immediately previous position to the current one, as this leads to maximum queuing delays. Since the episode length is only one time step, there is no previous reward to consider in the calculation.
Several training rounds have been performed with more gradual penalties provided by exponential decay functions as formulated in Equation (17):
where
a is a scaling factor that regulates the slope of the exponential function. Results show that the model is unable to converge into an optimal solution for values of the parameter
a lower than 4. In contrast, for values higher than 3, the model converges into solutions whose optimality is comparable to the levels provided by the models based on linear decay functions.
6. Evaluation
This section describes the training and evaluation of the previously defined models, presenting the results of this work. Before diving into specific details, the overall procedure for training and evaluating the models is outlined. Initially, each agent is trained independently from the other. This means that the scheduling of data frames is treated as transparent to the routing agent, and routing decisions are transparent to the scheduling agent. Once the training phase is complete, the trained models are jointly evaluated. For the evaluation, a jointly simulated environment has been developed. This environment sequentially interacts with both agents, sending states and receiving actions from each. All simulations were conducted on a computer equipped with a 3.30 GHz Intel Core-i9 processor and 64 GB of RAM.
6.1. Agent Description and Configuration
Both the routing and scheduling agents are implemented using the Rainbow algorithm [
25], an enhanced version of the deep Q-network (DQN), which is one of the most popular value-based DRL algorithms. Rainbow incorporates several improvements to DQN, focusing on enhancements to the neural networks and the replay buffer. The deployed neural networks include noisy layers and a physical differentiation of value and advantage layers, following the Rainbow architecture. Additionally, Rainbow employs two copies of the same neural network for training and target evaluation. The replay buffer is also enhanced to prioritize experiences that provide greater learning value. This modified version is called Prioritized Experience Replay (PER). Moreover, the learning process is further enhanced by calculating distributions of expected Q-values instead of single discrete values. Additionally, a mechanism called n-step learning is introduced to assess how well it is taking any action in the long term. The noisy layers of the neural network force the agent to explore until it learns how to ignore the noise, so it is used as an alternative exploration strategy to
-greedy. Moreover, an ADAM optimizer is used for updating the neural network. The complete set of hyperparameters configured for both agents is provided in
Table 1.
A deeper description of all the hyperparameters is provided to facilitate the understanding of the selected values:
The replay buffer size represents the number of experiences that can be accumulated in the replay buffer. Experiences are used to train the agent as long as they are stored at the replay buffer. Once the replay buffer is full, the earliest gathered experiences are discarded. In total, 1,000,000 samples is a typical value, as larger ones do not usually provide better results, and smaller ones provoke slower training convergence.
The batch size is the amount of experiences that are used for training the agent at each time step. In general, as this value increases, training becomes faster but less accurate, so a trade-off must be established. Based on previous experience, we use a size of 32 samples.
The target network update periodicity configures how often the weights and biases of the online neural network are copied to the target neural network. As this value decreases, training becomes faster, but the risk of losing convergence increases. As in the case of the batch size, a trade-off must be achieved. We use a periodicity of 1000 time steps due to previous experience.
The discount factor modifies the relevance of long-term future rewards against short-term ones. Values for this hyperparameter are usually high, which is recommended when the last reward of the episode is very relevant to the agent. For this reason, we use a discount factor equal to 0.99.
The learning rate is the factor used to update the neural network weights during the backpropagation process. As an ADAM optimizer is used, the provided value is just the initial one, as the optimizer modifies it during training. It is recommended to use initial values between 0.0001 and 0.01, so we use the minimum value of this range.
The soft update coefficient is related to the percentage of weights not passed to the target network when updating it. This mechanism provides smoother variations in the target network, which, as a consequence, enhances learning stability. We use a soft update coefficient of 0.005 to benefit from this feature.
Both hyperparameters related to the PER mechanism ( and ) are used for giving priorities to those experiences from which the agent can learn better policies. While alpha is the hyperparameter that provides the desired priorities, it might over-prioritize some experiences, so beta is used as a scaling factor. Typical values for are between 0.4 and 0.6, and typically ranges from 0.3 and 0.5. We provide important levels of prioritization by choosing and scaling those prioritizations with .
The value for the ADAM optimizer is a very small number that prevents the agent from exceptions related to dividing by zero. A typically used value is .
The n-step window size modifies the number of future rewards calculated by the n-step mechanism to evaluate the long-term effects of current actions. As this value increases, the computational cost also increases. However, better policies may be achieved. As in our environment, the episode length is short in general, we consider a size of three steps.
The margins of the distribution of Q-values are the minimum and maximum values considered in the distribution of returns in distributional DQN. Additionally, the atom size is the amount of values that are present in the distribution of returns. When this value is increased, computational costs also increase, but more sophisticated distributions are calculated, which makes the training process faster. Based on previous experience, we choose to define a distribution between −60 and 150 with 51 atoms.
An efficient neural network architecture is essential to balance the trade-off between learning capability and simplicity. Choosing a neural network that is too simple could result in inadequate learning, while carrying out the opposite could significantly decrease the learning speed due to the increased computational complexity. For this reason, each agent uses a neural network with a first standard layer consisting of 60 neurons, whose output is fully connected to a value branch with 60 neurons and an advantage branch with 60 more neurons. All subsequent layers are fully connected, and all neurons use the ReLU activation function.
6.2. DRL Agent Training
As previously described, the routing and scheduling agents are independently trained using the I2I topology shown in
Figure 3, but some training-related concepts are common to both agents. First, all of the service requests between TFCs and other TFCs or TMSs have been artificially created and saved in a dataset. This allows for fair comparisons of model performance when changes are made to the implementation. Once the agent is trained, additional datasets are used to evaluate the created models. To simulate network load, a configurable number of background data flows is routed and scheduled at the beginning of each episode. This ensures the agents can find links or positions within the hyperperiod that lack sufficient resources, forcing them to explore other valid routes. Background traffic data flows share the same properties as regular flows: they are routed along the shortest available path and scheduled at the first available position starting from the current one. Background traffic data flows are described by the same types of service requests as regular data flows. Specifically, 300 background traffic data flows are generated at the beginning of each episode between all 13 nodes in the network. These background flows remain constant throughout the episode—no new service requests are generated, and no existing connections terminate. This behavior is based on the assumption that control channels between TFCs and TMSs are established as long-term connections with minimal variability over time. Each training round for both agents lasts 500,000 time steps, which takes approximately one hour to complete.
Several training rounds are launched for the routing agent to enable it to learn how to establish optimal paths between any pair of TSN switches.
Figure 5 illustrates the evolution of rewards and losses experienced by the agent during the training process. The blue lines indicate the actual results gathered during training and the orange lines indicate averaged values. The figure shows a steady increase in the average reward as the agent continues to explore and learn. Eventually, the reward reaches a steady state, where it stabilizes at a more constant level. The final training reward of the routing model is 103.87 points, indicating that the agent learned optimally most of the routes. Regarding loss, it remains consistently low throughout the entire training process, highlighting the stability and efficiency of the training methodology.
For the scheduling agent, several training rounds are also launched. The objective is to enable the agent to be able to schedule data flows such that the residence time in intermediate nodes remains as low as possible.
Figure 6 illustrates the evolution of the rewards and losses during the training process. The average rewards quickly increased and stabilized at a constant level above nine points. However, the actual evolution of rewards provides deeper insights. The minimum reward value reached since the early stages of training was around −1, indicating that when the agent fails to schedule a data flow optimally, it often selects the subsequent position, resulting in only minor penalties. This behavior demonstrates the agent’s ability to adapt and minimize penalties even in suboptimal scenarios. The final average reward is 9.23 points. Regarding loss, although it is significantly higher than in the routing training rounds, it becomes stable early in the training process, reflecting the model’s convergence and reliability.
6.3. Simulation Results
As previously described, the models that are trained are also evaluated to test their performance, with all evaluations using the pre-trained routing and scheduling models. The evaluations involved assigning resources to 1000 service requests and analyzing the results at the end of each episode. The joint evaluation of the routing and scheduling models follows a sequential decision-making process between the two agents. Given that scheduling states depend on routing actions (as shown in
Figure 7), the evaluation begins by sending the first environmental state to the routing agent. The routing agent decides on the first link to form the path. Subsequently, the routing agent’s action is used to determine the position vector of the link where the routing agent is about to route data frames. The scheduling agent then decides which position should be used for scheduling data frames. Both actions are then sent back to the environment, allowing it to transition to the next state. This process is repeated until the service request is fully processed.
In terms of routing,
Table 2 shows the obtained results. The rows represent all the routes originating from each TSN switch, while the columns represent all routes ending at each TSN switch. Each field is marked with the following symbols:
✓ indicates that the shortest path is always established.
∼ indicates that the data flows can reach their destination without following the shortest path but within the maximum delay bound.
× indicates that no path can be established within the maximum delay bound.
Table 2.
Learning results for each possible route.
Table 2.
Learning results for each possible route.
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|
| |
---|
0 | - | ✓ | ✓ | ✓ | ✓ | ✓ | ∼ | ∼ | ✓ | ∼ | ∼ | ✓ | ✓ ∼ |
1 | ✓ | - | ✓ | ✓ | ∼ | ✓ | ∼ | ∼ | ✓ | ∼ | ✓ | ✓ | ✓ |
2 | ✓ | ✓ | - | ✓ | ✓ | ∼ | ✓ | ∼ | ∼ | ✓ | ✓ | ✓ | ∼ |
3 | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ | ∼ | ∼ | ✓ | ✓ | ✓ | ✓ |
4 | ✓ | ✓ | ✓ | ✓ | - | ∼ | ✓ | ∼ | ✓ | ✓ | ✓ | ∼ | ✓ ∼ |
5 | ✓ | ✓ | ∼× | ✓ | ∼ | - | ✓ | ∼ | ∼ | ✓ | ✓ | ✓ | ✓ |
6 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | ∼ | ∼ | ✓ | ✓ | ∼ | ∼ |
7 | ∼ | ∼ | ✓ | ✓ | ✓ | ✓ | ✓ | - | ∼ | ∼ | ∼× | ✓ | ✓ |
8 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ∼ | ∼ | - | ∼ | ✓ | ✓ | ✓ |
9 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | ∼ | - | ✓ | ✓ | ✓ |
10 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ ∼ | ✓ | × | ∼ | ✓ | - | ✓ | ✓ |
11 | ✓ | ✓ | ✓ | ✓ | ∼ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | ✓ |
12 | ✓ | ✓ | ✓ | ∼ | ✓ | ✓ | ∼ | ∼ | ✓ | ✓ | ✓ | ✓ | - |
In summary, there are 156 combinations of source and destination TSN switches, representing 156 optimal routes to learn. Among these, 111 are optimally learned (representing 71% of cases), 3 routes are partially learned optimally while still reaching their destinations within delay limits, 38 routes are established without following optimal paths but stay below the delay limits (24% of cases), 2 routes can only be partially established, and 2 routes cannot be established at all.
Regarding scheduling, 97% of operations are optimal, with the maximum error being one position, which adds an extra residence time equivalent to a single slot time (100 microseconds). The remaining 3% of operations are suboptimal because of the model’s performance limitations. However, the agent has learned a policy that is able to add the minimum possible sojourn time when it is not making optimal decisions.
Figure 8 shows the cumulative distribution function (CDF) of relative delays after jointly evaluating the routing and the scheduling models. The deviation was calculated as the difference between the actual end-to-end delay (from the agents’ decisions) and the theoretical delay of the optimal route. Approximately 80% of data flows reach their destinations with the minimum possible delay, and 95% of flows experienced extra delays of 6 milliseconds or less. Most of the extra delays are multiples of 1 millisecond due to routing issues, while smaller increments, in multiples of 100 microseconds, were caused by scheduling issues. Overall, routing issues have a more significant impact on extra delay than scheduling issues.
The models that are created are tested with several background traffic patterns. Specifically, several evaluations with 600 and 900 background traffic data flows are performed. Regarding routing, adding more background traffic data flows does not affect the number of routes that are learned, as the routes that are not established keep very low values for all cases. The most significant changes are due to the congestion of optimal paths that is experienced in the network when the background traffic level is increased. In these cases, the agent is able to find alternative paths whose delay is more significant but below the maximum bounds established in service requests, which constitutes an important reliability feature for the developed solution. Regarding scheduling, the agent is able to perform similarly for any background traffic level, achieving 97% of optimal schedules for 300 background traffic data flows, 94% for 600 flows, and 93% for 900.
A larger topology with 24 nodes is used as a secondary scenario, depicted in
Figure 9, where arrows represent unidirectional links between TFCs and TMSs, while the numbers next to the edges represent the delays
in milliseconds. The objective is to analyze the performance of both routing and scheduling models regarding scalability. Specifically, a new training sequence is launched, generating 1000 background data flows with random source and destination pairs at each episode. Regarding scheduling, no new models are required, as the ones that are trained for the smaller scenario are also valid for the big one: the routing decisions are transparent to the scheduling agent—it just learns how to schedule data frames over time given the resource availability information of a link, without knowing at which link it is scheduling frames. Regarding routing, there are 552 combinations of source and destination TSN switches, representing 552 optimal routes to learn. Among these, 464 are successfully established (representing 84% of cases). On the contrary, the remaining 88 routes (16%) can not be established. This scalability problem results in the proposed model performing less optimally for large scenarios. However, to solve this problem, a geographical multi-agent approach (where a routing and a scheduling agent are in charge of managing a subset of the network) could be addressed, which could be a possible future research topic.
It is crucial to note that, while end-to-end delays in the proposed solution may vary between different data flows originating from the same source and destination nodes, the periodicity of frame arrivals at the destination is guaranteed, ensuring the provision of synchronous services across the network. The guarantee is achieved by the definition of positions, which forces the scheduler to allocate the required resources to all designated slots to maintain the expected periodicity. Furthermore, since time is slotted and periodicity is preserved, the maximum jitter in this solution is limited to the duration of a slot, which is 100 microseconds. The proposed MARL solution demonstrates that the two agents manage the TSN network automatically and in real time, creating optimal routes for synchronous flows with an end-to-end latency variation of less than 6 ms and a jitter of fewer than 100 microseconds, achieving an effectiveness of over 95%.