1. Introduction
The implementation of a charging infrastructure and networks for electric vehicles (EVs) encounters numerous challenges [
1], particularly when serving distributed EV charging networks with limited wireless network resources [
2,
3]. Factors such as congestion during EV travel, diverse preferences among EV users, and uncertainties in decision-making regarding charging station (CS) resources profoundly impact system operation and network resource allocation. Meanwhile, with the development of smart grids, electric vehicle charging networks have also experienced rapid growth [
4]. Also, current and upcoming wireless networks, including 5G/6G and their subsequent versions, are expected to provide significantly improved data rates, reduced latency, and expanded network coverage compared to previous versions [
5,
6]. This progress in wireless networks stems from new design principles, enabling them to support a large number of connected devices simultaneously. Making sure things connect well and can share information easily is really important, especially for the growing number of Social Internet of Things (SIoT) apps [
7,
8]. They need smooth interactions to continue getting better.
In smart electric vehicle charging networks, the communication system adjusts multiple features to achieve desirable communication outcomes. For example, factors such as communication power allocation at charging stations, the presence of active or passive relays [
9], the communication channel quality, and the presence of obstacles may affect the overall communication conditions [
10]. As actions are taken to optimize the communication network, the system’s communication state changes accordingly. In [
11], this is represented using Markov Decision Processes (MDPs), with the communication quality serving as the reward factor. In this scenario, the numbers of interventions and states both exponentially increase.
To address these challenges, this paper introduces a novel framework leveraging emerging technologies, specifically reconfigurable intelligent surfaces (RISs) and causal-structure-based reinforcement learning techniques.
1.1. Background
Reconfigurable intelligent surfaces (RISs) are a revolutionary technology in the field of wireless communication and signal propagation [
12]. The structure of RISs typically includes a dielectric surface panel, which is a subgroup of periodic structures [
13] composed of repeating minimal geometric shapes called unit cells. Each unit cell contains conductive printed patches, also known as scatterers, with sizes that are a small fraction of the operating frequency wavelength. The macroscopic effect of these scatterers defines a specific impedance surface [
14], which, when controlled, can manipulate reflected waves from the dielectric surface panel. Each scatterer or cluster of scatterers can be adjusted to reconstruct electromagnetic waves with desired characteristics across the entire surface. By intelligently controlling the phase, magnitude, and polarization of reflected or transmitted waves, RISs can enhance wireless communication performance, improve signal coverage, and reduce interference in various wireless systems.
Electric vehicles (EVs) are considered an emerging strategy to reduce the dependence on oil and provide opportunities to reduce carbon emissions [
15]. The main elements of an EV system include EVs, charging stations equipped with charging points, and associated communication systems. Managing EV charging and optimizing their interaction with the power grid relies on appropriate communication infrastructure between EVs, charging stations, and the power grid. EV charging communication networks play a crucial role in promoting the widespread adoption of electric vehicles by providing reliable communication connections for EV owners, thereby contributing to the transition to sustainable transportation.
In [
16], social networks are utilized for searching internet resources, routing traffic, or selecting effective content distribution strategies. The Internet of Things (IoT) [
17] integrates a vast array of technologies, envisioning various things or objects interacting and cooperating with each other through a series of communication protocols to achieve common goals. The convergence of the Internet of Things and social networks into the Social Internet of Things (SIoT) [
18] is anticipated to have many desirable impacts on the future world. The SIoT aims to enhance the functionality, usability, and effectiveness of IoT systems by leveraging social relationships. By enabling IoT devices to collaborate, share information, and interact based on social context, the SIoT seeks to create more intelligent and adaptable IoT environments capable of addressing diverse user needs and preferences.
Meanwhile, causal reinforcement learning (CRL) [
19,
20] is a branch of reinforcement learning (RL) that incorporates causal reasoning into the decision-making process. Causal structures in machine learning refer to the graphical representation of causal relationships among variables in a given system. These structures capture the cause–effect relationships between different variables, enabling the identification of causal factors and the prediction of system behavior. Understanding causal structures is essential for making informed decisions, conducting causal inference, and designing effective machine learning models that can accurately capture and leverage causal relationships. In the complex communication environment considered in this paper, actions may not directly lead to observed outcomes. Instead, they may influence outcomes through intermediate variables, allowing CRL algorithms to make wiser decisions.
Furthermore, asynchronous advantage actor–critic (A3C) [
21] is a type of reinforcement learning algorithm that combines the advantages of both policy-based and value-based methods. A3C uses asynchronous training to update multiple agents concurrently, allowing for more efficient exploration of the action space and faster convergence to optimal policies. By incorporating an actor–critic architecture, A3C can learn both action policies and value functions simultaneously, leading to more stable and effective learning in complex environments.
The framework proposed in this paper aims to optimize resource allocation, thereby enhancing SIoT support within EV charging networks. By integrating RIS technology for electromagnetic wave control and applying causal RL algorithms, the framework dynamically adjusts resource allocation strategies to adapt to changing conditions in EV charging networks.
1.2. Limitations of RISs and CRL
However, there are limitations of RIS technology as well as causal reinforcement learning.
Firstly, regarding RIS technology, its practical application may encounter some limitations. For example, the deployment of RISs may require a substantial amount of hardware equipment and a complex installation process, which could increase system costs and deployment difficulties. Additionally, the performance of RISs may be influenced by environmental conditions, such as building structures or weather conditions, which could affect the effectiveness of the RIS and consequently degrade the communication quality and reliability.
As for causal reinforcement learning, its limitations primarily manifest in the model complexity and training time. Causal reinforcement learning may necessitate a large amount of data for training, and in complex environments, it may require significant time to converge to optimal solutions. Furthermore, the design and optimization of causal reinforcement learning algorithms may require specialized knowledge and expertise, potentially limiting their application in practical systems.
Moreover, challenges may arise in the interaction and integration of both technologies in practical applications. For instance, effectively integrating RIS technology and causal reinforcement learning algorithms into smart electric vehicle charging communication networks to achieve synergistic effects would require further research and optimization.
In summary, despite the potential advantages of RIS technology and causal reinforcement learning in smart electric vehicle charging communication networks, they also face practical limitations that need to be considered and addressed in real-world applications.
1.3. Related Studies
In a significant study [
22], deep reinforcement learning (DRL) was utilized to dynamically configure phase shifts in reconfigurable intelligent surfaces (RISs), leading to enhancements in signal coverage, a reduced interference, and an improved spectral efficiency. Moreover, in one study [
23], an exploration was conducted to leverage deep Q-networks (DQNs) for enhancing RIS-supported massive multi-input multi-output (MIMO) setups. This proposition centered on an adaptable control strategy, dynamically adjusting the phase shifts and beamforming weights associated with the RIS. This adaptation resulted in notable enhancements in the system’s capacity, coverage, and energy efficiency. In [
24], the author tackles resource allocation hurdles in vehicular communications. This is achieved by employing a multi-agent deep deterministic policy gradient (DDPG) method, where vehicle-to-vehicle (V2V) communications serve as agents utilizing non-orthogonal multiple access (NOMA) [
25] technology for spectrum sharing. By approaching the problem as a decentralized discrete-time and finite-state Markov Decision Process (DFMDP) and implementing the DDPG method, the suggested approach optimizes the sum-rate of V2I communications. It also guarantees the latency and reliability requirements are met for safety-critical V2V transmissions amidst a dynamic vehicular setting.
In recent years, traditional social networking has evolved into more intricate social internetworking, extending beyond human users to objects. Ref. [
26] has explored the Social Internet of Things (SIoT) and Multiple IoT (MIoT) paradigms, with the SIoT focusing on technological challenges of interacting IoT devices, while the MIoT delves into data-driven and semantics-based aspects of smart object interactions. This paper investigates this concept of the scope in multi-IoT scenarios, proposing formalizations and applications, followed by experiments evaluating its effectiveness compared to existing parameters like the diffusion degree and influence degree. In [
27], the author proposes a symbiotic radio (SR) system that supports both Internet of Things (IoT) and cellular networks, allowing multiple users to receive information from the base station while multi-IoT devices backscatter their data via the same signal. Leveraging robust design methods, the system minimizes the transmit power under cellular outage probability and multi-IoT transmission rate constraints, addressing channel uncertainty and demonstrating effectiveness through simulation results. Ref. [
28] proposes an energy- and trust-aware opportunistic routing approach for the cognitive radio Social Internet of Things (CR-SIoT), leveraging network coding and game-theoretic allocation of trusted channels to enhance the network performance, as validated by extensive simulation results. To improve the social edge service (SES) in the Social Internet of Things (SIoT), Ref. [
29] proposed a hybrid graph deep learning (HAD) approach that employs an adaptive trust weight (ATW) model and a quotient user-centric coeval learning (QUCL) mechanism, achieving an improved communication and computation performance and enhancing SES reliability.
Ref. [
30] presents structural causal modeling (SCM) as a method for ecologists to discern cause-and-effect relationships from observational data, overcoming biases common in traditional statistical analyses like confounding. Utilizing directed acyclic graphs (DAGs) and graphical rules like the backdoor and frontdoor criteria, SCM systematically estimates causal effects between variables of interest in ecological studies, showing promise for advancing causal inference without the need for randomized experiments.
However, there are few works that apply the causal structure in the field of communication. The application of causal reinforcement learning in RIS-assisted SIoT communication systems is a promising research direction. RISs can be used to adjust the transmission characteristics of signals to adapt to different communication environments and requirements. Causal reinforcement learning can utilize historical data and environmental feedback to provide intelligent decision support for an RIS, enabling it to adjust its operating mode and parameter settings according to real-time demands and network conditions. By learning based on causal relationships, the system can better understand the impact of RISs and make decisions based on these causal relationships, thereby improving the performance of the communication system.
1.4. Our Contribution
Compared to a previous work [
31], this paper differs mainly in the following aspects:
Focus and Background:
This paper focuses on dynamic resource allocation in RIS-assisted electric vehicle charging communication networks under cellular networks, with base stations as the core, especially addressing the wireless communication environment within electric vehicle charging networks. In contrast, the second summary emphasizes dynamic resource allocation in RIS-assisted mobile ad hoc networks (MANETs), particularly in addressing time-varying and uncertain wireless communication environments within multi-mobile ad hoc wireless networks.
Optimization Methods:
This paper proposes an asynchronous advantage actor–critic (A3C) algorithm based on causal factors to optimize communication network resource allocation control. It learns feature representations from incomplete communication environment states to accelerate the training speed, understand the causal relationships in the environment, and transfer training results to similar communication environments. On the other hand, the second summary introduces an inner–outer joint online optimization algorithm for RIS-assisted MANETs. It utilizes the D-UCB algorithm for RISs and spectrum selection in the outer network and employs the TD3 algorithm to gain decentralized insights into RIS phase shifts and power allocation strategies in the inner network.
Algorithmic Structure:
The CF-A3C algorithm in this paper first acquires causal factors, uses them as state acquaintances of the A3C network, and updates the global network using experiences collected by multiple worker threads, eliminating the need for a replay buffer and promoting efficient exploration in resource allocation tasks. Conversely, the TD3 algorithm in the second summary adopts an actor–critic structure with three target networks and two hidden layer streams in each neural network to segregate state-value and action-value distribution functions, accelerating convergence speed and enhancing the learning efficiency.
This paper presents a novel framework aimed at addressing the challenges associated with integrating the Social Internet of Things (SIoT) with connected electric vehicle (EV) charging networks. This framework harnesses emerging technologies, including reconfigurable intelligent surfaces (RISs), causal structures, and A3C-based reinforcement learning techniques, to optimize resource allocation and enhance SIoT support within EV charging networks.
By integrating RIS technology, which enables control over electromagnetic waves, and applying causal RL algorithms, the framework dynamically adjusts resource allocation strategies to accommodate the evolving conditions in distributed EV charging networks. Importantly, this framework aims to simultaneously meet real-world social requirements, such as fulfilling EV user charging needs, while ensuring efficient utilization of network resources, thereby enhancing communication performance.
The primary achievements elucidated within this manuscript are as follows:
Establishment of a model to represent the fluctuating and uncertain wireless communication setting for managing dynamic resource allocation in RIS-assisted electric vehicle charging communication networks. The model depicts the dynamic resource allocation system operating within an electric vehicle charging network.
Design of a causal inference model capable of reasoning about and addressing causal relationships in the electric vehicle charging communication network by acquiring effective representation distributions.
Proposal of a causal-factor-based asynchronous advantage actor–critic (A3C) algorithm based on the designed causal factor model for optimizing communication network resource allocation control. The feature representations are derived from learning the incomplete communication environment states. This method introduces a novel approach to training actor–critic networks, known as A3C, by directly updating global networks using experiences collected by multiple worker threads. By eliminating the need for a replay buffer, the method streamlines training and promotes efficient exploration in resource allocation tasks. This advancement accelerates learning while enhancing overall performance within the A3C framework.
The advantages of our work over existing contributions are presented in
Table 1.
The results of experiments conducted in different environments demonstrate that the CF-A3C algorithm is highly competitive with state-of-the-art resource optimization algorithms across multiple evaluation metrics.
2. System and Channel Model
2.1. Scenario Overview
In an electric vehicle charging communication network, there are two parts: uplink and downlink communication between the base station and vehicles and uplink and downlink communication between the base station and charging stations.
User demands are input information in the network, and this algorithm aims to optimize the channel capacity between base stations and vehicles while maximizing energy efficiency. By dynamically adjusting communication resource allocation strategies to meet the changing needs of different users, the algorithm achieves more flexible and efficient resource utilization by monitoring and analyzing changes in user demands and considering the real-time status of network resources. Furthermore, the application of RIS technology makes the utilization of network resources more flexible and efficient by adjusting the direction and intensity of electromagnetic wave transmission, enhancing signal coverage and the transmission accuracy and thus improving the network resource utilization efficiency while meeting users’ charging service needs. Additionally, the algorithm optimizes resource allocation strategies to avoid resource waste and overuse by integrating techniques such as causal reinforcement learning, maximizing network resource utilization while meeting user demands and thereby enhancing the overall network efficiency and performance. In summary, the A3C resource allocation control algorithm, assisted by an RIS, effectively balances user demands (electric vehicle charging) with network resource utilization, thereby improving the performance and efficiency of smart electric vehicle charging communication networks. This paper focuses on the downlink communication process between the base station and vehicles. In this network, using RIS-assisted communication can optimize the channel, enhance communication reliability, and improve performance. The communication process in electric vehicle charging involves uplink and downlink transmissions between vehicles and base stations, charging stations, and base stations with reflective intelligent surface (RIS) assistance. We introduce the communication between vehicles and the base station in the following.
Vehicles Send Data to yjr Base Station: Vehicles transmit uplink data to the base station through antennas based on their needs. These data may include the current status of the vehicle, charging demands, vehicle location, etc.
The Base Station Sends Data to Vehicles: The base station transmits downlink data to vehicles through antennas. These data may include the status information of charging stations, charging plans, traffic information, etc. If RIS-assisted communication is available, the base station can improve the channel quality and enhance the strength and reliability of signals reaching the vehicles through RISs.
RIS Processing of Downlink Signals: An RIS receives downlink signals sent by the base station and reflects the signals towards the direction of the vehicles based on pre-designed reflection coefficients and phase adjustments to enhance signal reception.
Vehicles Receive Signals: Vehicles receive downlink signals from both the base station and the RIS and utilize the received information to perform corresponding operations, such as adjusting charging behavior, updating charging plans, etc.
During the communication process in both uplink and downlink transmissions, there is interaction and processing of information among the base station, vehicles, and RISs. The base station is responsible for scheduling charging stations, processing information uploaded by vehicles, issuing charging plans, etc. Vehicles are responsible for uploading their own status, location, and other information, as well as receiving charging plans issued by the base station. RISs serve as an intermediary node, responsible for optimizing the channel, enhancing communication reliability and performance, and reflecting signals from the base station to vehicles or from vehicles to the base station. Through the above communication process and information interaction, the electric vehicle charging communication network needs to achieve efficient communication between charging stations and vehicles, providing support and optimization for the charging behavior of electric vehicles.
2.2. System Model
Exploring the RIS-enhanced electric vehicle charging network downlink procedure depicted in
Figure 1, we observe base stations (BSs) acting as transmitters, equipped with N antennas, and one RIS composed of
M element units for support, alongside
L single-antenna electric vehicle users (VUs). The communication landscape is challenging, characterized by traffic congestion, buildings, and various obstacles, leading to blocked direct signal links from BSs to electric vehicle users. Consequently, a two-hop communication system is established, necessitating a BS to relay signals through an RIS to reach the users. For user
k, the received signal at time
t is presented as:
In this scenario, the transmitted signal on the
k-th subcarrier is represented as
, the received signal is denoted by
, and the additive white noise is represented as
, following a normal distribution
. At time
t, the line-of-sight channel gain is presented as
, and the non-line-of-sight channel gain is presented as the channel gain matrices from the base station to the RIS relay and from the RIS relay to the vehicle as
and
, respectively. As
Figure 1 shows, the direct link is blocked by buildings or other obstacles, so the communication between the base station and vehicle users is through the non-line-of-sight channel. Additionally, for user
k at time
t, the RIS comprises
reflecting elements, represented by the diagonal matrix
, indicating their corresponding phases. Specifically, it is defined as
. Considering the transmit power
from the base station to user
k, the transmitted data
, and the beamforming vector
at the base station antennas, the term
is expressed as
, which is the transmitted signal
at time
t. The following constraints are applied to the transmit power at the base station:
where
,
is represented as
, and
represents the maximum transmit power.
2.3. RIS-Assisted Wireless Channel
We need to model two types of dynamic wireless channels in the system: one is the channel from the base station to the RIS relay, denoted as , and the other is the channel from the RIS relay to individual vehicle users (VUs), denoted as . The base station to the RIS channel model and the RIS to vehicle user channel model can be shown as:
Here,
represents the time-varying channel gain from the base station to the RIS relay. For the transmission data process from the BS to the RIS relay, the presentation of the array response vectors for multi-RIS units is denoted as
and
. Specifically,
and
. Next, the wireless channel model from the RIS relay to the user equipment (
) is described as follows:
Here, characterizes the time-varying channel gain from the RIS relay to vehicle user k at time t (), and represents the multi-antenna array response vector from the RIS relay to vehicle user k, defined as .
In the context of the non-line-of-sight (NLOS) situation of the communication systems, the time-varying signal-to-interference-plus-noise ratio (SINR) for user
k (where
) can be obtained as follows:
Moreover, the spectral efficiency (SE) of the real-time system, measured in bps/Hz, can be expressed as:
2.4. Causal Factors in RL Structure
2.4.1. Causal Graph
A directed acyclic graph (DAG) [
32] is a finite graph
consisting of a set of vertices
V and a set of directed edges
E, where each edge
is an ordered pair (u,v) indicating a direct connection from vertex u to vertex v. A DAG does not contain any directed cycles, meaning there is no sequence of edges that starts and ends at the same vertex by following the direction of the edges. Assigning a value to a particular variable
X is denoted as an action or intervention. Let
denote the parent nodes of variable
X; if variable
X undergoes an intervention, according to the backdoor criterion, all edges from
to
X are eliminated.
2.4.2. Structural Causal Model
The wireless environment exhibits ubiquitous causality, leading to causal changes in the wireless channel over time. Obtaining the causality of the time-varying wireless channel enables efficient modeling even with limited channel measurements. The key to representing wireless channel causality lies in developing suitable structural causal models (SCMs) [
33]. In this paper, we denote the SCM as
, which is a tuple
.
represents a set of endogenous variables, i.e., variables influenced by other variables in the study.
represents a set of exogenous variables, i.e., variables in the study not influenced by other variables. A set of structural functions determining
is defined as
.
represents the distribution over
.
Next, we formalize the multi-RIS assisted wireless system in the causal domain as
Figure 2 shows.
2.5. Analysis of Reinforcement Learning and the SCM
Structural causal models (SCMs) play a crucial role in causal reinforcement learning (CRL) by providing a formal framework for representing the causal mechanisms underlying the environment. SCMs encode how variables in the environment interact with each other to produce observed outcomes, allowing agents to reason about causal relationships and make informed decisions, as in
Figure 3. The following is a detailed explanation of the role of SCMs in CRL.
Representation of Causal Mechanisms: SCMs define the structural relationships between variables in the environment, including actions, states, and rewards. They specify how changes in one variable affect other variables, capturing the causal mechanisms that govern the dynamics of the environment. For example, an SCM might describe how taking certain actions influences the subsequent states of the environment and the resulting rewards received by the agent.
Causal Graph Construction: SCMs provide the foundation for constructing causal graphs, which represent the causal relationships between variables in the environment. Each node in the causal graph corresponds to a variable, and directed edges indicate causal influences between variables. SCMs specify the structure of the causal graph by defining the parents of each variable, reflecting the direct causal dependencies between variables.
Counterfactual Reasoning: SCMs enable counterfactual reasoning, allowing agents to reason about alternative scenarios and assess the causal effects of different actions. By manipulating the structural equations in an SCM, agents can simulate hypothetical interventions and predict how the environment would have behaved under different conditions. This allows agents to evaluate the causal consequences of their actions and make decisions that maximize expected rewards.
Policy Evaluation: SCMs facilitate the evaluation of policies by estimating the expected rewards associated with different action sequences. By simulating the causal mechanisms specified in an SCM, agents can compute the expected cumulative reward obtained by following a particular policy in a given environment. This allows agents to compare the effectiveness of different policies and select the one that maximizes long-term rewards.
Causal Inference: SCMs support causal inference by providing a formal framework for estimating causal effects from observational data or interventions. Agents can use techniques such as do-calculus or structural equation modeling to infer causal relationships from observed data and learn the structural parameters of the environment. This allows agents to build accurate causal models of the environment and make better decisions based on causal understanding.
In summary, SCMs play a central role in CRL by formalizing the causal relationships between variables in the environment, guiding decision-making through counterfactual reasoning and policy evaluation and facilitating causal inference from observational data. By leveraging SCMs, agents can acquire a deeper understanding of the causal structure of their environment and make more informed and effective decisions in complex and uncertain scenarios.
3. Problem Formulation
Developing an efficient resource allocation algorithm presents a considerable challenge, primarily attributed to the mobility of users and the inherent uncertainty of the wireless channel. However, by integrating causality into the framework, we can potentially alleviate these challenges, as causality provides a means to better capture and understand uncertainties within the system.
Initially, we conduct an analysis of power consumption throughout the resource allocation process and formulate the optimization problem with a focus on enhancing energy efficiency. Subsequently, leveraging the causal structure within reinforcement learning (RL), we reframe the problem into a causal Markov Decision Process (MDP). This approach enables us to incorporate causal relationships among variables, facilitating more informed decision-making in dynamic environments.
To address the resource allocation optimization dynamically, we propose an actor–critic reinforcement learning algorithm tailored to the causal MDP framework. By iteratively refining the policy through actor–critic updates, our algorithm aims to learn optimal resource allocation strategies that balance energy efficiency and performance.
Furthermore, we delve into the intricacies of our proposed algorithm, providing detailed explanations of its components, such as actor and critic networks, reward functions, and exploration strategies. Additionally, we discuss the training procedure and potential extensions or enhancements to our approach.
3.1. Power Consumption
The total power dissipated in the system, encompassing
K users, comprises various components, including the base station transmit power (
), hardware static power at the base station (
), power consumed by the RIS relay (
), and power consumption at the user equipment (
). With these components considered, the total power operating on the RIS-assisted wireless network can be defined as follows
where
with
to evaluate the ability to effectively convert input electrical power into output radio frequency (RF) power by the power amplifier.
Considering (
7) as the denominator of the energy efficiency (EE) function, then the EE performance
, with
B presenting the bandwidth, can be obtained using (
6) and (
7) as
3.2. Optimal Problem Formulation
As depicted in
Figure 2, the goal is to maximize the energy efficiency
by jointly optimizing the transmit power
from the BS, RIS selection and the phase shift matrix
from the RIS.
Markov Decision Process (MDP) formulation encompasses essential components including the state, action, transition probability function, reward, and environment. These components are elucidated as follows:
State space: Consider
as the state space, which comprises the following constituents: (i) the channel gains for communication links:
and
, (ii) the velocity and position of intelligent vehicle agents
, (iii) actions involving the configuration of phase shifting for the RIS components and power distribution of
implemented at time
, and (iv) the energy efficiency at time
. Hence,
S encompasses:
Action space: Symbolized as
A, the action space encompasses the array of actions available to the agent. It includes the manipulation of phase shifting for individual RIS components and the adjustment of transmission power at the base station. The action
is expressed as:
Transition Probability Function (P): This characterizes the likelihood of transitioning between states and a particular action. Formally, it is represented as , where denotes the subsequent state, s denotes the current state, and a signifies the action.
Reward function: The agent is provided with an immediate reward
, representing the energy efficiency as defined in Equation (
8).
Value Function: This indicates the anticipated cumulative reward, originating from a specific state according to a predetermined policy
. Employing
to symbolize the value function at time step
t under policy
, it signifies the anticipated total of rewards beginning from
and extending to the conclusion of the episode:
The
Q-value function denotes the anticipated return, commencing from a designated state
, executing a particular action
, and subsequently adhering to policy
. It is articulated as follows:
In line with the foundational principles of optimal control theory [
34], the optimal value function, along with the optimal policies for optimal resource allocation modulation, can be derived as follows:
Here, and represent the resource allocation control policies. Specifically, pertains to the transmit power control policy, while corresponds to the RIS phase shift control policy.
Additionally, adhering to Bellman’s principle of optimality [
35], the dynamic representation of the finite horizon optimal cost function unfolds as follows:
3.3. Causal MDP Formulation
In this section, we introduce causal factors that influence the state variables within the framework. Actions in causal MDPs are depicted as interventions. At each state
s, we construct the reward graph
. The variables representing rewards are
R and states are denoted by
S. The agent can modify variables
, but it is not allowed to intervene on the parent variables of
R (represented as
) or the parent variables of
S (represented as
). As described in
Section 3.3, an intervention applied to an action of size
m is symbolized as
, where
and
.
At each state s, the causal graph
includes the variables
. It is important to note that the identity of variables in the causal graphs remains consistent across states, although the underlying distributions may vary. Detailed explanations of these notations are provided in
Figure 2. Utilizing this causal knowledge, we define the transition probability function as
, and the reformulation of the reward function becomes:
Consider the function called
, which tells us the expected reward when we have a certain state and parent pair. Now, we will introduce another function called the Q-value function, denoted as
. This function is all about giving us the total rewards we can expect under a policy
, starting from a given state
and a parent state
and continuing until the episode ends:
According to the Bayesian Law of total probability formula, can be expressed as the sum over all possible states z in Z of the conditional probability of Z being z given the current state s and action a, denoted by , multiplied by the expected reward . Expressing as is a consequence of considering an MDP enhanced with dynamic causal graphs and . This formulation can be captured by a tuple .
4. Causal MDP-Based RIS-Assisted Resource Allocation Optimization with Online Reinforcement Learning
An ensemble DNN algorithm is provided to address the problem of the probability distribution of the causal factor Z required in Equation (
17). Then, based on the obtained causal information, to tackle Equation (
15), incorporating intervened actions from (
10), we employ the asynchronous advantage actor–critic(A3C) algorithm to optimize the entire network output. The structure of our proposed algorithm is illustrated in
Figure 4.
4.1. Causal Factor Encoder and Decoder with a Deep Neural Network
There are
L electric vehicle (EV) users in the charging network and
L corresponding subcarriers in the network, where each subcarrier’s state information is represented as the concatenation of
described in Equation (
9) and
a described in Equation (
10),
, where
, and
d is the dimensionality of the state features. The input sequence is represented as
. The corresponding feature representation
can be regarded as the global environmental information for subcarrier
i,
is subcarrier
j’s state and
is a weight for
:
where
is the attention vector.
For the state set of the subcarrier i’s neighborhood
, i.e.,
, we employ fully connected (FC) layers (i.e., a shared weight matrix
, where
d is set to 256 in the simulation section) to transform the input state
of each subcarrier j and then obtain the embedding
. Thus, there will be a total of
embeddings, i.e.,
. Finally, an attention mechanism layer aggregates all these embeddings to obtain
, which can be considered as the current global embedding of subcarrier i. The network architecture is shown in
Figure 5. The complementation is provided in Algorithm 1 below.
Algorithm 1 Causal factor extraction based on a self-attention mechanism |
- 1:
Input: State set of each subcarrier , ; subcarrier neighborhood sets , ; weight matrix ; attention vector l - 2:
Position Encoding: - 3:
Initialize the position encoding for each state vector to consider the sequence order. - 4:
Encoder: - 5:
for each time step do - 6:
Concatenate the state vectors of all subcarriers into a matrix X. - 7:
for encoder layer do - 8:
Pass the output through a feed-forward neural network layer to capture non-linear relationships. - 9:
Compute the self-attention scores for each subcarrier’s state vector. - 10:
Apply the self-attention mechanism to obtain the weighted sum of each user’s state representation. - 11:
end for - 12:
Obtain the causal factor for each subcarrier from the final encoder layer output. - 13:
end for - 14:
return Causal factor for each subcarrier.
|
After the last hidden layer, the extracted causal factor representation for the system can be utilized for subsequent tasks, which is the resource allocation optimization of the communication network. The algorithm is provided in the following.
4.2. An Algorithm Utilizing Causal Factor-A3C for RIS Phase Shifting and Power Allocation
The overall concept is shown in
Figure 6. The causal factor extraction network learns from the SCM of the environment and trains on the agent networks.
A3C facilitates parallel training by allowing multiple worker threads to interact with their copies of the environment and update the global network asynchronously.
The A3C algorithm utilizes an asynchronous architecture, enabling parallel training across multiple worker threads. Each worker thread interacts with its instance of the environment and updates the global actor and critic networks asynchronously.
At the core of the A3C method lies the central actor network, responsible for determining which actions to take based on the current situation represented by z. Guided by its parameters , this network is fine-tuned to select the best possible action given the state, denoted as . Meanwhile, the critic network evaluates the value of chosen actions by estimating the Q-value function , where represents the critic network’s parameters. These parameters are adjusted during training to capture the overall long-term rewards represented by .
In our A3C approach (Algorithm 2), we directly update the global actor and critic networks based on experiences collected by multiple worker threads. Here is how the A3C network would be regenerated without the replay buffer.
Algorithm 2 CF-A3C-based resource allocation optimization algorithm |
- 1:
Input: Global network parameters , number of threads - 2:
Initialize global network parameters - 3:
for thread = 1 to in parallel do - 4:
Initialize thread-specific environment and network parameters - 5:
Initialize episode counter - 6:
while not done do - 7:
Receive initial observation state - 8:
Put state into the causal factor extraction network to obtain causal factor - 9:
for t = 1 to do - 10:
Select action using policy network - 11:
Execute action and observe reward and new state - 12:
Put state into the causal factor extraction network to obtain causal factor - 13:
Perform gradient ascent on global network parameters using the observed transition - 14:
Synchronize local network parameters with global network parameters - 15:
end for - 16:
Increment episode counter - 17:
end while - 18:
end for
|
Global actor and critic networks are initialized with parameters and . Multiple worker threads are created, each with its instance of the environment and local actor and critic networks. Each worker thread interacts asynchronously with its environment. At each time step t, the worker receives the current state from the environment and puts it into the causal factor extraction network to get the causal output . They then select an action according to the policy , execute the action and observe the reward and the next state . Also, they obtain through the same process. They store the experience tuple , calculate the advantage function using the sampled batch, update the actor parameters , and calculate the loss for the critic network using the sampled batch to update the critic parameters .
The gradients computed by each worker are applied asynchronously to the global actor and critic networks. The global networks are updated using the gradients from each worker, ensuring that all workers are trained on the most recent version of the networks.
This approach eliminates the need for a replay buffer, allowing for direct updates to the global networks based on fresh experiences collected by multiple workers. It promotes efficient exploration, accelerates training, and improves the overall performance by leveraging the A3C framework in optimizing resource allocation tasks.