1. Introduction
Cognitive radio is a technology that is used to settle the problem of spectrum scarcity by enabling secondary users (SUs) to access the licensed spectrum of primary users (PUs) in a dynamic and non-interfering manner. It mediates the contradiction between the regulation and frequency utility via time and spatial multiplexing [
1]. One of the major challenges in the design of cognitive radio networks (CRNs) is radio resource management, which efficiently handles the spectrum mobility and quality of service (QoS) requirements of different services and different nodes [
2]. Consequently, techniques for resource management have been receiving considerable attention [
3,
4].
However, most research in the CRN field has focused on the direct transmission network. The research community has focused on applying the cognitive paradigm in multi-hop networks to supply more spectrum resources for a range of applications [
5,
6]. To fully study the characteristics of multi-hop CRN, it is critical to coordinate route selection with radio resource management and design a cross-layer routing protocol for high-spectrum utility. The regional difference in a multi-hop path leads to the difference in the frequency band for all SUs [
7]. The design of efficient and robust spectrum-aware routing protocol is challenging due to the absence of topology information and spectrum dynamics in CRNs.
Traditional protocols used decompose the cross-layer design problem into two sub-problems including route selection and resource management. The sub-problems were optimized separately to reduce the system calculation complexity. Ding et al. [
8] proposed a distributed and localized scheme for joint relay selection and channel assignment. A cooperative strategy for the optimization problem based on real-time decentralized policies was studied. However, full cooperation was difficult to achieve for SUs in CRN due to its uncertainty. A centralized routing protocol was proposed by Lai et al. [
9], which expanded the paths hop-by-hop and discarded unnecessary paths. Then the resource management problem was formulated into an optimization problem with the aforementioned objective and restriction. Nevertheless, the centralized method was more complex in calculation and less flexible than distributed schemes. An economic framework was presented by Amini et al. [
10] that integrated route selection and spectrum assignment for the improvement of QoS performance in cognitive mesh networks. The decomposed model allowed for a decentralized implementation of routing and spectrum allocation, which increased the robustness of the algorithm. These works aimed at finding a fixed path from the source nodes to the destination nodes [
11]. However, fixed routing strategies have two disadvantages: priori knowledge of topology and spectrum dynamics is required, which is hard to obtain in multi-hop CRN, and fixed protocols may become invalid due to the dynamic and uncertain nature of frequency.
In order to realize real cognition, a cognitive radio (CR) should be capable of learning and reasoning [
12]. Machine learning technology has been widely used to address dynamic channel statistics in CRNs. Raj et al. [
13] proposed a two-stage reinforcement approach to select a channel via a multi-armed bandit, and then predict how long the channel would remain unoccupied. Its sensing was more energy efficient and achieved higher throughput by saving on spectrum detection. Al-Rawi et al. [
14] developed the cognitive radio Q-routing algorithm, which adopted reinforcement learning (RL) method to enable flexible and efficient routing decisions. Single-agent reinforcement learning with limited capacity was adopted by Raj et al. [
13] and Al-Rawi et al. [
14]. Nevertheless, learning strategy with multiple agents is more suitable for solving complicated problem in multi-hop CRNs. Multi-agent learning approaches in CRN have drawn the interest of researchers for their superior performance. A conjecture-based multi-agent Q-learning scheme was presented by Chen et al. [
15] to execute power adaption in a partially observable environment. However, route selection in multi-hop CRN was not considered in this work. Pourpeighambar et al. [
16] modeled the routing problem as a stochastic game (SG). Then, the SG was solved through a non-cooperative multi-agent learning method in which each secondary user (SU) speculated other nodes’ strategies without acquisition of global information. The current conjecture belief was determined only by the last one in [
16], which caused strong correlations between the samples. Power adaption was not considered when solving the routing problem, which would influence power efficiency and the routing decision.
In our previous work [
17], we designed a single-agent based intelligent joint routing and resource assignment scheme for CRN to achieve the maximum cumulative rewards. In this paper, we adopt a quasi-cooperative multi-agent learning scheme for routing and radio resource management, which is more efficient than the single-agent strategy in multi-hop CRN. The scheme tries to achieve the lowest end-to-end delay and improve energy efficiency with finite information exchange between competing SUs. Our contributions are summarized as follows:
- (i)
In order to jointly capture the end-to-end latency and power efficiency, a comprehensive utility function is designed to form a reasonable tradeoff between the two as well as accommodate the maximal transmission latency requirement. Queuing theory is adopted to analyze single-hop latency and provide a theoretical basis for our cross-layer routing protocol design.
- (ii)
A quasi-cooperative multi-agent learning framework is presented to solve the cross-layer design problem where every SU node speculates other nodes’ strategies from finite information exchange with the previous nodes. The convergence of the quasi-cooperative learning scheme is proven.
- (iii)
For the purpose of further enhancing performance, experience replay is applied to the update of conjecture belief, which allows for greater data efficiency by using the historical conjectures and breaks the correlations to reduce the variance of updates.
The remainder of this paper is organized as follows: the system model is presented in
Section 2.
Section 3 models the cross-layer design problem as a SG. The quasi-cooperative multi-agent learning scheme for the cross-layer routing protocol is proposed in
Section 4.
Section 5 demonstrates the simulation results. Finally, the paper is concluded in
Section 6. In addition, summary of acronyms used in this paper is listed in
Table 1.
2. System Model
A multi-hop CRN comprising
PUs and
SUs is considered. Specific spectrum bands are assigned to PUs according to the fixed spectrum allocation regulation. SUs occupy no licensed channels and transmit data opportunistically when finding that frequency bands are not held by the PUs. Every SU node
has a set of available channels that consist of Data Transmission Channel (DTC) and Common Control Channel (CCC). The DTC is used for data transmission and SU
’s DTC is represented as
; whereas the CCC is used by SUs to exchange the negotiation. At any time, a directed communication link can be constructed between SU
and
if at least one common DTC
exists. The network model is shown in
Figure 1. In the networking scenarios, the multi-hop CRN coexists with two centralized Primary User (PU) networks. As
Figure 1 shows, the source SU generates data packets and sends them to the destination node in multi-hop manner through intermediate SUs. Each PU communicates with the PU base station using a licensed frequency, and intermediate SUs in each hop transmit data via the PU channel when the spectrum band is idle.
We assume that every node in the multi-hop CRN maintains a queuing buffer for the storage of SU packets. For data flow
generated from source node
, packet arrivals are considered as a stationary Bernoulli process with mean
that is independent and identical at all time slots [
18]. In addition, every SU node has respective queues for each traffic flow, and the packet arrival process of every data stream is independent from each other.
The PUs’ occupation model is considered as an ON/OFF process [
19]. The probability density function (PDF) of the OFF periods (when PUs do not occupy the channel) is shown as follows:
where
is departure rate of the PU, and
represents the idle probability of PU channel at time step
t. Accordingly, the probability that idle period of PU channel is longer than duration
is denoted as:
where
is the duration that PU channel is idle. Then, the probability of the collision between PU and SU in the duration
(i.e., the probability of PU reoccupying the spectrum band in the duration
) is given by:
We use the analysis described in [
20] to calculate the PU departure rate
with expected mean
and deviation
. The spectrum statistic that is parameterized in [
21] changes slowly so that it is assumed to be almost static in this work. Every SU can only locate its own position through some positioning equipment.
4. Joint Routing and Resource Management with Conjecture Based Multi-Agent Q-Learning
In order to introduce the quasi-cooperative multi-agent learning scheme, a brief introduction to multi-agent Q-learning is provided in
Section 4.1. In
Section 4.2, the Equal Reward Time-slots based Conjectural Multi-Agent Q-Learning (ERT-CMQL) is presented to solve the cross-layer routing problem. Then, the analysis and proof of its convergence is outlined in
Section 4.3.
4.1. Multi-Agent Q-Learning
Among various algorithms in the RL framework, Q-learning is a practical approach adopting Q-value. Q-value is the total expected discounted reward for the pair of state-action and describes the value of choosing a particular action in a given state. It weights and ranks the probabilities of different actions, i.e., the action with a higher Q-value is more valuable and given higher selection probability, and vice versa. To achieve this object, the Boltzmann distribution is used to calculate the probability of choosing action
at time slot
t:
where
is the Q-value for the pair of state-action
at time step
t,
is a positive number called the temperature. The larger the temperature is, the more balanced the probability of action selection becomes, and vice versa.
M players are considered and every player is fitted with a Q-learning agent that learns its own strategy through limited cooperation with other agents. Thus, the Q-value is updated according to multi-agent Q-learning rule:
where
is the learning rate, and
is the discount factor, and
is the expected reward for SU
at time slot
t considering other
competing SUs. The variation of Q-value is proportional to the expected reward plus the difference between the target and evaluated Q-value. The detailed definition of
is given by:
where
represents the joint probability of
’s competing SUs choosing actions
in their respective states. From Equations (20) and (21), in multi-agent Q-learning, the agent needs not only its own transmission strategy but also the complete information of competing SUs’ strategies
to update the Q-value of SU
. However, it is not always practical to observe other SUs’ private information in multi-hop CRN with finite cooperation. Therefore, designing a quasi-cooperative multi-agent learning scheme, which only needs private strategy and information exchange with its previous nodes, is challenging.
4.2. Equal Reward Time-Slots Based Conjectural Multi-Agent Q-Learning
Multi-agent learning strategy is more practical for solving the joint design problem in multi-hop CRN. The main drawback of establishing a multi-agent learning framework is the demand for complete information of competing SUs. Due to high communication overhead and topology complexity, it is impractical for SU nodes to cooperate with competing SUs and share their private information in multi-hop CRN. To resolve this contradiction, a conjecture-based multi-agent learning scheme with quasi-cooperative scenario is proposed, where each SU node conjectures other SUs’ behavior strategies without full coordination among agents.
Specifically, from Equations (20) and (21), the mixed-strategies for other competing SUs is defined as
, which represents the joint probability that competing SUs perform strategy vector
at time slot
t. In other words, estimating
becomes the key challenge when applying the multi-agent Q-learning framework. To combat this, the conjecture belief
is introduced to approximate
, and the ERT-CMQL is proposed to asymptotically determine
without complete network information. The probability that the agent chooses
in state
while other competing SUs execute action vector
is given by:
receives expected reward
when the agent of
performs action
, while other nodes select action vector
in state
. That is, the probability that
acquires
is
.
is the number of time steps between any two moments in which
achieves the same return
. Each
is independent of the others and follows the same distribution of
. The average value of
is denoted as
, which can be obtained via the private information from historical observation. Then we have the approximate equation
[
15], i.e.,
. Since every SU knows its own transmission strategy
, the agent can estimate
via:
After obtaining the expression of
using local information shown in Equation (23), the updating rule of the conjecture belief is explored. In quasi-cooperative learning scenarios, agents update their conjecture belief based on new observations. Since
is a stationary stochastic process in the time dimension, its mean value
is a constant. Specifically, the quotient of the conjecture belief at time slot
t − 1 and
t can be calculated as:
Then, the conjecture belief is updated as follows:
Since
, the updating rule in Equation (20) can be rewritten as:
Equation (26) shows that every SU node only uses private strategy and limited information exchange with its previous nodes to update its Q-value. conjectures the mixed-strategies for other competing SUs on the basis of their variations in response to their own strategy.
However, strong correlations exist between
and
, which may cause the parameters to easily stick in a poor local optimum and then make
close to 1 infinitely. Since the updating rule of
is fractional which has a strong reliance on
, the conjecture belief is also inclined to fall into the local optimal solution. To avoid the shortage and further improve the system performance, experience replay is applied to the conjecture based multi-agent learning scheme. From long-term-observations,
is a constant value due to the time stationarity of
. So the probability that
receives an expected reward
(i.e., agents perform action vector
in state
) is approximately equal to the reciprocal of mean time interval regardless of time step, that is:
where
t and
v represent any two time slots. Thus we have
.
and
at each time step are stored as the agent’s experience at each time slot, and pooled over many episodes into a replay memory [
27]. During learning, we randomly sample the experience
and
from memory pool to update the conjecture. Then the update of conjecture belief at time step
t is given by:
This approach has several advantages over consecutive updating rule in Equation (25). First, each time-step of the strategy is potentially used in the update of conjecture, which improves data efficiency instead of updating directly from consecutive samples. Second, strong correlations between and may result in a local optimal. Randomizing the samples breaks these correlations and reduces the variance in the updates.
The details of ERT-CMQL are obtained as described in Algorithm 1.
Algorithm 1 Equal Reward Time-Slots Based Conjectural Multi-Agent Q-Learning |
1: Initialize: |
2: Set and memory size . |
3: For each Do |
4: For each Do |
5: Initialize transmission strategy , conjecture belief , |
Q-value , and replay memory . |
6: End For |
7: End For |
8: Repeated Learning: |
9: For each Do |
10: For do |
11: Initialize state . |
12: Loop |
13: Select action according to the strategy . |
14: Execute action , and obtain strategies of previous nodes |
and the SINR indicator . |
15: Observe reward and state according to (14) and (16). |
16: Update based on according to (26). |
17: Update the strategy according to (19). |
18: Sample experience and from . |
19: Update the conjecture belief according to (28). |
20: Store and in . |
21: |
22: |
23: Until is the terminal state |
24: End For |
25: End For |
4.3. Analysis of ERT-CMQL
Littleman [
28] provided the convergence proof of the standard Q-learning. Based on the theory, the convergence of ERT-CMQL is investigated in this section.
Lemma. Suppose there is a mapping , and denotes the set of all SUs’ Q functions. The updating rule converges to with probability of 1, if:
- (1)
- (2)
A number exists such that for all .
To apply the lemma in the convergence proof of our proposed ERT-CMQL, the definition is given as follows:
Definition. Let , where for , and . Then the mapping is defined as , where:In addition, for any , the definition of the distance between Q-values is given as: Firstly, we prove the first condition in Lemma 1 for ERT-CMQL.
Proposition 1. is equal to the expectation of its map , i.e., , where .
Proof. According to the Bellman’s optimality equation [
29], we have the following expression:
Since the reward
is irrelevant to
, then Equation (31) can be modified as:
Based on the previous analysis in
Section 4.2, we have
. Then we prove that
. □
Proposition 2. There is a number such that .
Proof. In accordance with the definition of the distance between Q-values, we have:
where the second equation is derived from Equation (29), and the first inequality is obtained according to the transformation
. It can be easily proven that
, so we can attain the second inequality. Since
is unrelated to
,
can be rewritten as
and the last equation is attained using Equation (30).
Next, we apply Equation (23) to the item
, and the expression can be rewritten as:
where
is the reward when
selects action
in state
, while other nodes select action vector
,
is the average number of time steps between any two moments in which
achieves the same return
, and
and
are the strategies of
in state
at different time-step.
When
is sufficiently large, we have:
and:
where
is a polynomial of order
. By applying Equations (35) and (36) to Equation (19), it can be verified that:
and:
where
is a polynomial of order smaller than
.
Substituting Equations (37) and (38) into Equation (34), we have:
A sufficiently large
can be taken so that:
Then we have the following inequality:
which leads to:
where
. Then we have
, so that
, which leads to
. Consequently, condition (2) is satisfied in the Lemma, and ERT-CMQL is proven to converge if
is large enough for all agents. □
5. Simulation Results
In this section, the performance of our quasi-cooperative multi-agent learning scheme is evaluated using an event-driven simulator coded in Python 3.5. The network model and learning framework are built based on the Python packages Networkx and Numpy, respectively. The results of the proposed ERT-CMAQL are compared with (1) Cooperative Multi-Agent Q-Learning (CMAQL) which is the ideal scheme and has complete information of the competing SUs; (2) Conjectural Multi-Agent Q-Learning without Experience Replay (CMAQL-ER); (3) Fixed Power- based Conjectural Multi-Agent Q-Learning (FP-CMAQL) proposed in [
16] which transmits data with a fixed power level; (4) a single agent Q-learning scheme called Q-routing presented in [
14] and (5) Prioritized Memories Deep Q-Network (PM-DQN) based joint design scheme proposed in our previous work [
17].
In the multi-agent learning framework, we initialize the conjecture belief
, the Q-value
, and the transmission strategy
for each
. Other system parameters are given in
Table 2.
To verify the performance of algorithms, a small CRN containing 10 SUs and 4 PUs is simulated at first. SUs in CRN are uniformly deployed in a 300 × 300 m region. In addition, the available transmitting power consists of five levels:
. The network topology is shown in
Figure 2.
Without loss of generality, SU 6 is taken as an example. The single-node performance of SU 6 for different algorithms is shown in
Figure 3 and
Figure 4.
Figure 3 illustrates the average reward of SU 6 versus the iteration index. The expected reward firstly rises and then stays almost steady for all schemes. Furthermore, we find that, when converged, CMAQL outperforms all other algorithms. The reward of ERT-CMAQL is slightly lower than CMAQL, followed by the CMAQL-ER scheme, and FP-CMAQL obtains the lowest reward. This occurs mainly because, in the CMAQL scheme, agents have true strategies of competing SUs through global information exchange. In ERT-CMAQL, each agent approximates mixed-strategies of other SUs via the conjecture belief that may be not sufficiently accurate. We can see that the reward of CMAQL-ER is slightly higher than that of ERT-CMAQL before 200 iterations, and afterward, that ERT-CMAQL is superior to CMAQL-ER. The reason for this is that at first few samples are stored in replay memory and the correlation is weak between the samples, so that experience replay is inefficient compared to the consecutive updating rule. At the later stage the advantage of experience replay is fully demonstrated when samples are abundant. In addition, FP-CMAQL obtains the lowest average reward, which illustrates the importance of power allocation.
The effectiveness properties of transmission latency and power efficiency are demonstrated in this part. In
Figure 4a, single-hop latency declines in the beginning and flattens after about 700 iterations for all kinds of protocols. CMAQL achieves the lowest transmission latency, which is a little shorter than that of ERT-CMAQL. The expected delay of ERT-CMAQL is about 32% lower than CMAQL-ER, which benefits from experience replay to avoid a poor local minimum and enhace data efficiency. The transmission delay of FP-CMAQL is much longer than the other three schemes because it fails to adjust the transmission power with channel status causing larger overall latency.
Figure 4b shows the average power consumption versus iteration index. The PCR of the four algorithms, in increasing order, is as follows: CMAQL, ERT-CMAQL, CMAQL-ER and FP-CMAQL. The reason for this is the same as in
Figure 4a. Consequently, the results illustrate that the energy efficiency of proposed ERT-CMAQL is close to the optimum scheme, and holds a clear advantage over other algorithms.
Figure 5a shows the expected one-hop delay of different SU nodes for CMAQL, ERT-CMAQL, and CMAQL-ER schemes. The transmission delay of CMAQL is the lowest compared with the other two algorithms, the latency of ERT-CMAQL is slightly higher than that of CMAQL, and CMAQL-ER achieves the longest one-hop delay for all SU nodes in the network. This illustrates the effect of experience replay, which makes the performance of ERT-CMAQL close to the optimum, demonstrating a clear advantage over the schemes applying the consecutive updating rule.
In addition, we find that SU 6 achieves the longest expected one-hop delay of all SUs, while the transmission latency of SU 4 and SU 7 is relatively low on average. This is due to SU 6 being the closest SU to the destination node so that data flows pass through it with higher probability. The locations of SU 4 and SU 7 are relatively isolated, so the arrival packets are scarce. SU 5 is the destination node and no packet is transmitted forward so that its transmission delay is 0. Comparison of PCR for the three kinds of protocols varying in the SU index is shown in
Figure 5b. We can find that the PCR of ERT-CMAQL is close to CMAQL for all SUs. CMAQL-ER achieves the highest PCR of the three schemes for all SU nodes, which consumes 46% more power per throughput on average than ERT-CMAQL. The reason for this is similar to the reason for the results in
Figure 5a. Therefore, the proposed ERT-CMAQL achieves relatively low transmission latency and power consumption close to the optimum for every SU node in the network.
Figure 6 illustrates the effect of PU arrival probability on expected end-to-end delay and system PCR. As shown in
Figure 6a, it is found that the transmission latency of the four algorithms grows as the PU arrival probability increases. This is because the larger PU arrival probability results in more interruption and transmission failure, which causes longer delays for data retransmission. CMAQL achieves the lowest transmission delay, followed by ERT-CMAQL. The latency of CMAQL-ER is higher than that of ERT-CMAQL, and FP-CMAQL has the longest expected end-to-end latency. This demonstrates the advantage of our proposed ERT-CMAQL, which produces performance closest to the ideal value. Furthermore, the transmission latency values of the four algorithms are relatively close when PU arrival probability is low. However, they differ considerably as PU arrival probability increases. When the probability of PU arrival is low, there is little conflict between PUs and SUs so the four algorithms achieve almost the same latency. Since CMAQL and ERT-CMAQL can better avoid conflicts with PU, CMAQL and ERT-CMAQL are capable of maintaining relatively low latency when PU arrival probability increases. However, in CMAQL-ER and FP-CMAQL, the data transmission of SUs is often interrupted by PU arrival so the transmission latency is high. The effect of PU arrival probability on PCR shown in
Figure 6b has a similar trend to
Figure 6a, which will not be detailed here.
Next, for a more general case, a networking scenario comprising 20 SUs and 10 PUs uniformly deployed in a 500 × 500 m area is considered. The available transmit power contains ten levels:
. The network topology of the second experiental scenario is shown in
Figure 7.
The comparison of system performance for different kinds of experimental environments is illustrated in this section.
Figure 8 depicts the expected end-to-end latency of six algorithms in networks with 10 SUs and 20 SUs. It can be seen that with increasing number of routes, the end-to-end delay sharply declines and then remains steady for all algorithms in both networking scenarios. When converged, CMAQL achieves the lowest end-to-end delay due to the complete information, which helps agents make more accurate and comprehensive decisions. Given the conjecture belief and experience replay, the total transmission delay of ERT-CMAQL is close to CMAQL. CMAQL-ER, with its consecutive updating rule, consumes more time transmiting packets from the source to the destination than ERT-CMAQL, followed by FP-CMAQL. The transmission latency of the two single-agent schemes is particularly larger because, in these two schemes, all the information and computations are processed by a separate agent, which is inherently less efficient than multi-agent schemes. PM-DQN produces a longer end-to-end delay than Q-routing in the network with 10 SUs, but its performance is superior to Q-routing in a large network. This illustrates the advantage of PM-DQN in the networking scenarios with large state space.
By comparing the performance of the two networking scenarios, we can find that the end-to-end latency of all protocols in the network with 20 SUs is relatively longer than in the small-scale network, and the latency of single-agent schemes increases more apparently than in other algorithms. The reason for this is that the links of the route increase with increasing SUs in the network, so that the accumulated single-hop latency along the route, i.e., the end-to-end delay, grows as well. Single-agent schemes are more sensitive to the number of SUs, which leads to longer latency in total. Furthermore, it is observed that the convergence speed of multi-agent schemes remains almost the same in both networking scenarios. However, Q-routing and PM-DQN converge at around 1700 routes in the first experiment, and nearly 3000 routes in the second. From the theoretical analysis, we find that multi-agent learning schemes are not affected by network scale because each SU equips an agent and follows the same learning rule. The calculation load rises as the number of SUs grows, which has heavy impact on single-agent schemes with only one agent in the network.
We further investigate the packet loss ratio (PLR) of the six protocols for different networking scenarios in
Figure 9. In both experimental environments, CMAQL has the lowest PLR, followed by the proposed ERT-CMAQL. The PLR of Q-routing and PM-DQN is obviously higher than other multi-agent learning schemes, which illustrates the reliability of using multiple agents. Comparing
Figure 9a,b, we find that the PLR of Q-routing is larger than PM-DQN in the first network, whereas PM-DQN is more robust than Q-routing in large-scale networks. This is because PM-DQN has higher efficiency in networks with large state space due to the capability of the neural network. In addition, the PLR of all multi-agent schemes remains almost the same, which demonstrates that the routing reliability is not affected by network scale for multi-agent learning algorithms. The reason for this finding is that, in multi-agent collaboration schemes, every SU node equips an agent regardless of the network size, which improves the robustness as the number of SUs increases.