1. Introduction
Future wireless networks must continually evolve to adapt to changing wireless communication environments. In harsh and complex scenarios such as wilderness exploration and underwater sensor deployments, the high mobility of nodes leads to frequent changes in network topology, resulting in message loss or significant delays. Traditional routing protocols are often unreliable and ineffective in such circumstances. To facilitate networking and communication in these challenging environments, researchers have introduced a specialized network paradigm that relies on the mobility of nodes to relay information: Delay Tolerant Networks (DTNs) [
1]. As an innovative communication approach, DTNs provide a new method to address the instability of connections in wireless networks. They no longer depend on the connectivity requirements of traditional networks but instead utilize cognitive data transmission between mobile nodes. In DTNs, traditional network connectivity requirements do not apply, as they rely on opportunistic encounters between mobile nodes for data transmission. This allows nodes to relay messages when they come into contact with each other, even in resource-constrained environments. Store-carry-forward routing [
2] is one of the key mechanisms for addressing routing challenges in DTNs. If a node is not connected to other nodes, it stores messages in its buffer until another node enters its communication range. Once this occurs, the message is forwarded to that node, with the hope that intermediate nodes can relay the message to the destination node.
Based on this concept, researchers have proposed various DTN routing protocols. DTN research focuses on the design of message routing and forwarding protocols, which play a key role in DTN message transmission. There are three main types of routing protocols: historical information-based routing protocols, community-based routing protocols, and reinforcement learning-based routing protocols.
In history-based routing protocols, protocols utilize historical information about nodes to assess the probability of encounters between nodes, enabling the selection of the next-hop node for message forwarding. The Prophet protocol [
3] calculates the probability of nodes encountering the destination node based on their historical information and uses this encounter probability as a utility value to evaluate the likelihood of successfully forwarding messages to the destination node. Ref. [
4] proposes a privacy-preserving protocol for utility-based routing (PPUR) in Delay Tolerant Networks (DTNs). The protocol aims to address privacy concerns while optimizing routing decisions based on utility. Ref. [
5] presents the Flooding and Forwarding History-Based Routing (FFHBR) algorithm, which determines the best relay node by analyzing the node encounter history and deciding whether to flood propagate the message or forward it directly to the target node. Ref. [
6] proposed an Enhanced Message Replication Technique (EMRT) for DTN routing protocols, where EMRT dynamically adjusts the number of message replicas to minimize overhead and maximize the delivery rate based on encounter-based routing metrics, network congestion, and capacity.
In terms of community attributes, ref. [
7] proposes an opportunistic social network routing algorithm (UADT) based on user-adaptive data transmission that takes into account factors such as user preferences, social relationships, and network conditions to improve the efficiency and effectiveness of data transmission in opportunistic social networks. Ref. [
8] introduces an adaptive multiple spray-and-wait routing algorithm based on social circles in delay-tolerant networks (DTNs). The algorithm dynamically adjusts the number of message copies sprayed based on the encounter history and social relationships between nodes. Ref. [
9] presents a Hybrid Social-Based Routing (HSBR) protocol that utilizes social characteristics, such as community structure and social similarity, to make forwarding decisions. Ref. [
10] presents the Community Trend Message Locking (CTML) routing protocol. The protocol locks messages to specific communities based on their content and trends, improving message delivery efficiency in DTNs. However, these routing protocols struggle to adapt quickly to node mobility and changes in network topology. In recent years, researchers have increasingly recognized the application of reinforcement learning [
11] in DTNs. Q-learning [
12] is a value iteration-based reinforcement learning algorithm that learns a value function called the Q-function. The double Q-learning proposed in [
13] is an enhanced version of the Q-learning algorithm. Existing routing algorithms based on reinforcement learning models often depend on the reward function, which significantly influences the effectiveness of the learning strategy. Depending on different routing optimization criteria, tailored reward functions are designed for use during the training and learning processes. Ref. [
14] evaluates the probability of encounter between nodes using the Q-learning algorithm for packet casting in vehicular opportunity networks and decides the choice of next-hop node together with relative velocity. Ref. [
15] introduces an adaptive routing protocol based on improved double Q-learning, where the algorithm determines the next-hop node for data packets based on a hybrid Q-value. Although the above-proposed Q-learning-based DTN protocol has good results, the algorithm is difficult to learn sufficiently under the extreme conditions of the network to ensure that the best route can be guaranteed to be found in all situations. Ref. [
16] proposes a delay-tolerant network routing algorithm called Double Q-learning Routing (DQLR); it utilizes a double Q-learning algorithm to address the issue of packet delivery delay in DTNs. Ref. [
17] proposes a Double Q-learning-based routing protocol for opportunistic networks called Off-Policy Reinforcement-based Adaptive Learning (ORAL). The protocol utilizes a weighted double Q-estimator to make routing decisions, addressing the limitations of existing protocols.
Table 1 compares the types and drawbacks of the DTN routing protocols mentioned above.
On the basis of the above, we employ a reinforcement learning algorithm, double Q-learning, to adapt to the rapid changes in the network, in addition to defining the relationships between nodes in the DTN and synthesizing the information attributes. By performing multi-decision routing for the forwarding nodes based on the above factors, the nodes can more effectively decide whether the next hop will successfully forward the message to the destination node.
Most existing routing protocols lack real-time awareness during network transmission and struggle with efficient message forwarding. In response, we designed a Multi-Decision Dynamic Intelligence (MDDI) routing protocol. This protocol combines reinforcement learning algorithms with node relationships and message attributes to achieve a multi-decision routing selection. Its goal is to maintain good message transmission performance in various states of the network. In MDDI, the Q-values in the Q-tables within the nodes are continuously adjusted with network changes, and a double Q-updating strategy is used to avoid overestimation and provide more accurate Q-values for evaluating the performance of the nodes. Meanwhile, in order to adapt to each state of the network, we also take the node relationship into account to decide the best next hop when routing based on the node interaction information as well as the message attributes to provide the overall performance of the DTNs.
The main contributions of this paper are as follows:
(1) Real-time sensing of network performance. In calculating the reward value and the discount factor, we use the average delay of the network, the average number of hops, and the message attributes of the nodes after forwarding the message as the learning training content.
(2) Introducing node relationships to improve routing decisions. Combining the historical interaction information between nodes and considering the global network, nodes are classified into three types of relationships: friends, colleagues, and strangers.
(3) Reinforcement learning is combined with node relationships as well as message attributes to decide the best next hop during the routing process to achieve dynamic intelligent routing decisions.
The rest of this paper is organized as follows: In
Section 2 and
Section 3, routing decisions in a double Q-learning environment and dynamic decisions based on node relationships in DTNs are described, and a multi-decision dynamic intelligent routing protocol is proposed. The performance of the proposed protocol is evaluated and compared with conventional protocols in
Section 4.
Section 5 discusses the impact of the proposed protocol and some potential limitations. Finally,
Section 6 concludes this work and proposes future research directions. To clearly describe the proposed routing protocols in the following sections, we provide a list of notations and abbreviations in
Table 2.
2. Routing Decision in Double Q-Learning Environment
In this section, we will explore the application of Q-learning algorithms for routing decisions in DTNs. Our main focus will be on two key components: the dual Q-learning algorithm and the Q-table update.
2.1. Double Q-Learning Algorithm in DTNs
Q-Learning is a value-based algorithm in the field of reinforcement learning, and its core involves constructing a table called the Q-table. In this table, each row corresponds to a state, each column corresponds to an action, and the value in each cell represents the maximum expected future reward for taking a specific action in a particular state. By progressively updating the Q-Table, the algorithm can gradually learn the optimal action strategy for each state. In this way, by selecting the action with the maximum value in the corresponding row for each state, the goal of maximizing cumulative rewards can be achieved. The Q-learning model iteratively trains through various components, including agents, states, actions, rewards, episodes, and Q-values. The update formula for the state-action values in the Q-learning algorithm is shown in (
1).
S and
A represent a state and an action, respectively. Q represents the Q-value for taking action
A in state
S,
represents the next state,
represents the next action,
R represents the actual reward obtained from taking that action,
is the learning rate,
is the discount factor, and max
represents the Q-value of the action with the highest Q-value among all possible actions in state
S.
The significance of (
1) is to gradually establish the optimal policy by updating the Q-value in the Q-table for the action that yields the highest reward in the current state
S. The discount factor
takes into account future rewards, making the algorithm more focused on long-term benefits rather than short-sighted strategies. It also results in smoother updates of Q-values, avoiding abrupt changes. It can be observed that the Q-learning algorithm’s update is a typical application of temporal difference methods.
During the message delivery process from source nodes to destination nodes, the entire network provides the necessary information for message forwarding. Therefore, in this paper, we consider the entire mobile network as a reinforcement learning environment, where nodes that relay messages are treated as intelligent agents. The action space of these agents consists of forwarding data packets to the next-hop nodes, and selecting the appropriate next-hop to forward the message is considered an action choice in reinforcement learning. All nodes in the network can serve as storage nodes for data packets, so the collection of all nodes in the network forms the state space of the agents. When a message is successfully forwarded to the next-hop node, the agent receives an immediate reward value from the environment, which is used for updating the Q-values.
Figure 1 depicts a typical reinforcement learning example within a delay-tolerant network. In this illustration, a message resides at node
a, which can be regarded as a state.
represents message
i whose destination node is
d. Nodes
b,
c, and
e represent the next action choices made by node
a to forward
to the destination node
d. The Q-table stores the Q-values associated with node
a by choosing nodes
a,
b, and
e as forwarding nodes for
. When the message eventually reaches its destination node
d, we update the rewards associated with node
a of choices of nodes
a,
b, and
e as forwarding nodes for
.
In the proposed protocol, each node in the network stores and maintains two Q-tables, denoted as and . These Q-tables consist of columns and rows, storing reward lists for the best actions for each state. When a node in the network begins to move without having forwarded any messages yet, it initializes the corresponding Q-values in its Q-tables to 0 upon encountering other nodes. Whenever a message is successfully relayed to the destination node by a node that has relayed the message, it updates its two Q-tables and shares its local information only with neighboring nodes. The two Q-values stored in the respective Q-tables are used for predicting the optimal action for the next step and predicting the optimal action for the termination state (successfully forwarding the message to the destination node). One Q-value is used for selecting the best action, and the other is used to estimate the maximum Q-value. Double Q-learning is employed to address the problem of overestimation that may lead to routing local optima. The two Q-values change with variations in the network’s topology and node states, allowing the proposed algorithm model to adapt to highly dynamic environments.
2.2. Q-Value Update
In DTNs, the dynamic movement of nodes allows for opportunistic connections between nodes. Messages are transmitted hop by hop towards the destination node, starting from the source node, following routing protocols, and leveraging opportunistic contacts between nodes. During this process, the Q-value update formula is used to establish and update the corresponding state-action values. In this reinforcement learning environment, the learning task is assigned to each node, and as a result, the learning process involves updating the Q-tables;
and
learn from each other, and they represent future reward values for the same action and state with the same update formula but different values. Their update formulas are shown in (
2) and (
3).
and represent node a selecting node b as the destination node for , it represents the expected reward value for the next-hop node, which is the Q-value of node a selecting node b for relaying. represents the learning rate of a node in the network, which is used to control the extent to which Q-values change with dynamic network variations. represents the actual reward obtained by node b after forwarding to node b. is the discount factor associated with node a of forwarding of . By adjusting , we can control the degree to which the node considers short-term and long-term consequences when selecting the next hop. In extreme cases, when = 0, it only considers the current outcome of the action, whether the next hop is the message’s destination node. When approaches 1, it places more emphasis on previous learning results. max and max refer to the highest expected reward values when node a forwards to all possible nodes, with x and y representing the corresponding relay nodes at that time. and represent the set of contact nodes for node a, indicating all nodes that node a encounters during its movement. From the equation, it can be observed that Q-value updates primarily depend on the learning coefficient, the actual reward value, and the discount factor.
When
reaches the destination node
d, the
R values of all nodes along its forwarding path are updated. The actual reward value
R reflects the advantages and disadvantages of one-time forwarding. End-to-end delay represents the time it takes from when a data packet is sent from the source node until it is received by the destination node. The magnitude of delay directly impacts the availability range of DTN and is an important metric to consider in the design of routing protocols. The average hop count is the total sum of hops experienced by copies of all messages in DTNs, divided by the total number of messages generated in the network. The average hop count, along with end-to-end delay, reflects the overall performance of routing protocols, with fewer hops and a lower delay indicating more efficient routing protocols. Therefore, in this paper, when calculating reward values, consideration is given to both hop count and delay to control energy consumption. The update formula for the reward value
R is defined as follows in (
4).
represents the average delay in the network at the moment. signifies the time it takes for to reach the destination node after being forwarded from node a to node b. represents the contribution of node a’s choice of node b as the next hop to reducing the average network delay. The smaller the value of , indicating a greater contribution to reducing the average network delay, the larger the reward value. signifies the hops it takes for to reach the destination node after being forwarded from node a to node b. , which denotes the contribution of node a’s selection of node b as the next hop to reducing the average number of hops in the network. The smaller value of signifies a greater contribution to reducing network overhead, resulting in a higher reward value. In this context, the definition of 3 represents the average number of hops obtained through multiple simulations and emulations of the prophet routing algorithm.
The discount factor will affect the likelihood of selecting the node previously chosen for forwarding. In this paper, the discount factor is defined to be updated when
is forwarded or reaches the destination node. It is calculated by (
5)
is a discount factor constant, represents the total lifetime of the message, and indicates the remaining time to live for the message. denotes the number of times has been forwarded at this time. The larger the value of , the more frequently will be forwarded, indicating that the efficiency of previously selected forwarding nodes is moderate. The smaller the value of , the shorter the remaining lifetime of , emphasizing that node a should prioritize the results obtained from selecting the current next hop.
Incorporating the relationship between nodes and messages, as well as real-time performance metrics in the network, such as average message hops and average latency, enables the double Q-learning algorithm to accurately assess the corresponding Q-values when nodes forward messages through learning from the network environment.
3. Proposed Routing Protocol
In this section, we introduce a dynamic decision-making routing approach based on node relationships and combine it with the algorithms introduced in
Section 2 to propose and elaborate a multi-decision dynamic intelligent routing protocol.
3.1. Dynamic Decision Based on Node Relationships
In the design of community-based routing schemes, the primary challenge lies in defining the social relationships between nodes and determining the mechanism for message propagation. Due to intermittent connections between nodes, there is a need for opportunistic message routing. In other words, message exchange occurs only when two nodes are within each other’s range, and one node has a lower probability of delivery than the other. Node features can serve as metrics for quantifying node social relationships. Previous research has mainly relied on historical encounter information to derive social relationship attributes. However, depending solely on social relationships can lead to an uneven distribution of loads on nodes with higher degrees of connectivity. Moreover, most have primarily focused on node attributes and have not comprehensively considered message attributes and interaction quality. In this paper, when defining network node relationships, we take into account both the global knowledge of nodes and the comprehensive node interaction information. The network node relationships are defined as the following three types:
Friendship Relationship: A friend node, compared with other nodes in the network, interacts with the local node more frequently, and the quality of interaction is higher.
Colleague Relationship: A colleague node, compared with other nodes in the network, has occasional contact with the local node within a certain timeframe, but the quality of interaction is generally lower, indicating a less intimate connection.
Stranger Relationship: A stranger node, compared with other nodes in the network, has minimal or almost no social interaction with the local node.
Figure 2 illustrates that when nodes
a and
d established a connection at time t0, they also connected at moments t1, t2, t3, and t4 within the past time interval
T. When they are in the process of connecting, both nodes update their historical interaction information, including the average number of encounters within a period
T, the average encounter duration, and the average number of forwarded messages.
Next, let’s take nodes
a,
b, and
as an example, suppose we now need to decide whether to forward
from node
a to node
b or not. First, the historical encounter information between node
a and node
b is computed. Node-to-node relationships are determined by historical encounter information and interaction data. Nodes
a and
b have encounters within the time interval
T, and each encounter lasts for a duration of
. The number of encounters is
. The average encounter duration is as in (
6)
The quality of interaction between node
a and node
b is evaluated based on the number of messages successfully forwarded to each other. The number of messages forwarded at a time is
. The average message forwarding quantity between nodes
a and
b within the time interval
T is calculated by (
7).
Within the time interval
t, the total sum of historical encounters between node
a and other nodes it has come into contact with, the total sum of historical average encounter times, and the total sum of historical average forwarded message quantities are calculated by (
8), (
9), and (
10), respectively.
Substitute (
8)–(
10) into the (
11) to calculate the contact value
.
Relationship between node
a and node
b is defined by (
12):
In (
12),
and
represent the friend threshold and colleague threshold, and after multiple simulation verifications, the network performs best when
and
are set to 0.15 and 0.09, respectively.
Based on node relationships, integrating message remaining time and message forwarding count as one of the decision factors allows for more accurate and efficient forwarding decisions. When nodes a and b establish a connection, if nodes a and b have a friend relationship with the message destination node, then compare their relationship values and forward the message to the node with a greater relationship value.
If nodes
a and
b have a colleague relationship with the message destination node, not only consider the relationship value but also take into account the message forwarding count, remaining message time, and historical encounter interval. The encounter interval between node
a and node
b is
, and their average encounter interval within the period
(
<
T) is calculated by (
13), and
n is the number of intermittent encounters between them within the period
.
If two nodes have not encountered each other for more than one period
, then the historical encounter interval is updated to:
represents the average encounter interval within the period T. N represents the number of periods during which the nodes have not encountered. is equal to 0 when two nodes have not encountered each other for more than one period T. Compare the of with the historical average encounter interval. If the hop count of is less than 3 and the remaining time is greater than the and , then perform a relationship value comparison; otherwise, forward the message to node b.
If nodes a and b have a stranger relationship with the message destination node d, meaning that the probability of nodes a and b transmitting the message to the destination node is low, then the message is not forwarded and remains at node a, awaiting the next connection.
3.2. Multi-Decision Dynamic Intelligent Routing Protocol
This section combines the double Q-learning algorithm with node relationships and message attributes to propose a multi-decision dynamic intelligent routing protocol. In this protocol, Q-values and node relationship values serve as the primary decision criteria for nodes to determine whether to forward messages. Additionally, it takes into account changes in message states to ensure effective message transmission. Within the double Q-learning framework, reward computation considers the network’s overall average delay and average hops, effectively controlling the number of message duplicates and reducing message delivery latency during the intelligent learning process.
Figure 3 shows the general workflow of the proposed MDDI routing protocol. The MDDI protocol consists of two main parts: a single decision based on double Q-values (SDDQ) and a dynamic decision based on node relationship (DDNR). SDDQ mainly uses the Q-value in the Q-learning algorithm to make decisions, and the discount factor changes dynamically and is determined by the state of the message in the node. While the actual reward value is updated after the message reaches the destination node. When the connecting node is not the destination node of the message, SDDQ protocol is applied first; if the message cannot be forwarded according to SDDQ protocol, then DDNR is further used. DDNR uses node relationships and message attributes to make a dynamic decision, and the node relationship is derived according to the interaction information between nodes and then integrated with the state of the message at this point in time to make a decision on whether to forward or not. Finally, the Q value and relationship value are updated.
The following section provides a detailed description of the application process of this routing algorithm within the network. In this multi-decision dynamic intelligent routing model, each node in the network stores and updates four tables: two Q-tables, a dynamic node attribute table for participating in routing decisions, and a dynamic message attribute table. The updating of the Q-tables is based on the description in
Section 2.2. The dynamic node attribute table contains global information related to the node. It stores encounter information about all the connected nodes for this node, including the number of historical connections within time interval
T, average connection duration, average message forwarding count, and average encounter interval. The Dynamic Message Attribute Table stores message and node-related attributes, including the time and number of hops that a message is forwarded from that node to the next-hop node until it reaches the destination node, and the discount value used by that node in calculating the Q-value of the node that determines the next-hop node for that message. As an example, let’s consider a general node
a within the network. As shown in the tables below, In
Table 3,
b,
c,
d denote the nodes that have established a connection with node
a. The table stores their historical number of encounters with node
a during the period
T, the average encounter duration, the average number of forwarded messages, and the average encounter interval during the period
. In
Table 4,
,
,
are the messages stored in node
a, and their corresponding dynamic attributes are
,
,
, where
is the discount factor corresponding to different messages, and
and
correspond to the number of hops and time to reach the destination node after the message is forwarded to different nodes.
In
Figure 4, when nodes a and b establish a connection, they examine their respective dynamic attribute tables (
Table 3) when deciding whether to forward a message. They determine their relationship with the message’s destination node based on node relationship values. Simultaneously, node a uses the content of its dynamic message attribute table (
Table 4) to calculate the corresponding R-value and discount value for forwarding the message to d and compute the corresponding Q-value. The message state is considered at nodes a and b when they share a simultaneous relationship with the destination node.
Algorithm 1 provides a detailed description of the specific decision process of this protocol. When node
a has established a connection with node
b, the message
m is pending forwarding within node
a. First, it is checked whether node
d is the destination node for message
. If it is not, then a Q-table is randomly selected, and the corresponding Q-value for the connected node is checked to see if it is the maximum. If the Q-table does not exist or the value is not the maximum, a further decision is made based on the node relationships.
Algorithm 1: Actions when node a establishes a connection with node b |
|
If both nodes have a friend relationship with the destination node, only the relationship values are compared, and the message is forwarded to the node with the higher relationship value. If node b has a colleague relationship with the destination node, and node a also has a colleague relationship with the destination node, then their contact with the destination node is not frequent enough, so decisions cannot be based solely on encounter information.
The forwarding hop count and remaining message time are compared. If the forwarding count is less than three and the remaining message time is greater than the average encounter interval between the two nodes and the destination node, then further relationship value comparisons are performed. If node a and the destination node are strangers, the message is forwarded to node b.
If both node a and node b are strangers to the message destination node, meaning that the probability of both node a and node b transmitting the message to the destination node is low, then the message is not forwarded and remains in node a, awaiting the next connection.
5. Discussion
In Delay Tolerant Networks (DTN), nodes can only transmit messages through opportunistic connections, posing a challenge in effectively utilizing network resources for message forwarding. In previous protocol studies, achieving both high delivery rates and low overhead has been a difficult balance. Our protocol employs dynamic multi-decision-making, ensuring excellent performance in various network states. In
Section 4, validated through simulations, the MDDI routing protocol’s ability to fully utilize network resources and enhance message transmission efficiency in mobile network environments is demonstrated.
Compared to other routing protocols introduced in
Section 1, our proposed protocol has significant advantages in multiple aspects. Firstly, even in situations where network resources are limited, the MDDI routing protocol maintains high delivery rates and low latency. This protocol utilizes dynamic changes in node and a message attributes during network topology changes, ensuring outstanding performance regardless of whether the network has few nodes or experiences network congestion. Secondly, the MDDI routing protocol consistently incurs lower network overhead. In contrast to some routing protocols discussed in Part Two, which also maintain high delivery rates but generate numerous message duplicates due to flooding, the MDDI routing protocol, through multiple decision-making solutions, precisely selects the next hop for messages, ensuring message delivery to the destination node while effectively controlling the number of message duplicates.
However, despite the numerous advantages of the MDDI routing protocol, there are potential limitations that need consideration. One potential limitation lies in the complex implementation of MDDI routing protocol attributes. Each node in the network needs to maintain four tables (two Q tables, a node attribute table, and message attribute table). In intermittent connection routing environments, these table values require real-time updates, potentially consuming significant node energy. Moreover, frequent interactions between nodes may lead to security issues, making them susceptible to attacks from malicious nodes. For instance, a malicious node might attempt to simulate another node to access its private information.
In addition to the potential limitations discussed above, the protocol has not been widely deployed in practical networks. Implementing this protocol in real-world scenarios may face various challenges. One challenge is that the protocol might not always maintain high performance in different scenarios. For instance, in disaster and emergency rescue scenarios, historical connections among nodes might not follow strong patterns, and urgent messages should be assigned a higher security level, not solely based on encounter history. Another challenge is privacy and security. In military scenarios, selecting nodes with high trust levels for data transmission is crucial.
Addressing these challenges requires a comprehensive consideration of the characteristics of different scenarios and flexible adjustment of the MDDI protocol’s parameters and strategies. In disaster and emergency rescue scenarios, adopting more flexible message priority algorithms ensures the timely transmission of urgent messages. In military scenarios, establishing trust models, selecting trustworthy nodes for data transmission, and enhancing network encryption and identity verification measures can improve data security.
To sum up, the introduction of the MDDI routing protocol not only underwent thorough theoretical exploration but also received validation through practical simulations. Its advantages, including high delivery rates, low latency, and minimal network overhead, make it an ideal choice in mobile network environments. These research findings not only supplement the existing knowledge base but also provide valuable insights for the development of more advanced network systems in the future.
6. Conclusions
In this paper, we propose a Multi-Decision Dynamic Intelligent (MDDI) routing algorithm for delay-tolerant networks. The algorithm fully considers the characteristics of nodes interacting with information and messages and the variability of network topology. Firstly, we introduce the intelligent double Q-learning algorithm, enabling nodes to learn throughout the entire network. Each node maintains two Q-tables and decides whether to forward a message based on the Q-values in the tables. If the connected node does not have the maximum Q-value in the Q-table, decisions will be made based on the relationship between the nodes, at which point we first determine the relationship between the nodes, i.e., whether they are friends, colleagues, or strangers, and combine the message attributes to make a final decision. Through multiple decisions and intelligent algorithms, we can effectively utilize various resources of the network and sense the states of nodes and messages in real-time, thus improving the network performance.
The simulation results show that our proposed protocol consistently maintains the highest delivery rate as well as the lowest latency and network overhead for different cache sizes, as well as different message survival times and generation intervals. Moreover, analyzing the result graphs, it can be seen that the network overhead of MDDI remains low in all network states, while the message delivery rate can be high even in extreme cases, such as node congestion and short message survival time.
In the next step of our research, we plan to enhance the cache management component in our subsequent research to minimize message loss. Additionally, we will design a node authentication mechanism to enhance data transmission security. Moreover, we intend to implement a confirmation mechanism to promptly discard duplicates of messages that have already been successfully transmitted in the network.