Next Article in Journal
BoT2L-Net: Appearance-Based Gaze Estimation Using Bottleneck Transformer Block and Two Identical Losses in Unconstrained Environments
Previous Article in Journal
Designing a Multi-Output Power Supply for Multi-Electrode Arc Welding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine-Learning-Based Optimization for Multiple-IRS-Aided Communication System

School of Information and Electronics Engineering, Beijing Institute of Technology, Beijing 100081, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(7), 1703; https://doi.org/10.3390/electronics12071703
Submission received: 15 March 2023 / Revised: 26 March 2023 / Accepted: 28 March 2023 / Published: 4 April 2023

Abstract

:
Due to the benefits of the spectrum and energy efficiency, intelligent reflecting surfaces (IRSs) are regarded as a promising technology for future networks. In this work, we consider a single cellular network where multiple IRSs are deployed to assist the downlink transmissions from the base station (BS) to multiple user equipment (UE). Hence, we aim to jointly optimize the configuration of the BS active beamforming and reflection beamforming of the IRSs that meet the UE’s QoS while allowing the lowest transmit power consumption at the BS. Although the conventional alternating approach is widely used to find converged solutions, its applicability is restricted by high complexity, which is more severe in a dynamic environment. Consequently, an alternative approach, i.e., machine learning (ML), is adopted to find the optimal solution with lower complexity. For the static UE scenario, we propose a low-complexity optimization algorithm based on the new generalized neural network (GRNN). Meanwhile, for the dynamic UE scenario, we propose a deep reinforcement learning (DRL)-based optimization algorithm. Specifically, a deep deterministic policy gradient (DDPG)-based algorithm is designed to address the GRNN algorithm’s restrictions and efficiently handle the dynamic UE scenario. Simulation results confirm that the proposed algorithms can achieve better power-saving performance and convergence with a noteworthy reduction in the computation time compared to the alternating optimization-based approaches. In addition, our results show that the total transmit power at the BS decreases with the increasing number of reflecting units at the IRSs.

1. Introduction

With the increasing demands for higher data rates, higher spectrum and energy efficiency, and ubiquitous connectivity for the beyond-fifth-generation and sixth-generation wireless communication networks [1], many improved wireless technologies have been proposed (e.g., unmanned aerial vehicle communication, satellite communication, Terahertz communication, etc.). Among them, the intelligent reflecting surface (IRS) is regarded as a promising and efficient solution, which uses software-controlled signal reflection to reconfigure the wireless propagation environment [2]. Specifically, an IRS is made up of many reflectors with reflecting passive elements, each of which can independently change the corresponding amplitude and phase angles of the incident signals [3]. Via the proper design, the reflected signals can be combined coherently at the intended UE, thereby enhancing the desired signal, and/or destructively at non-intended UEs, thereby eliminating co-channel interference. Thus, the quality of the communication link between the transmitter and the intended receiver can be enhanced. Furthermore, because of the IRSs’ fundamental characteristics and competitive advantages in terms of optimizing energy and spectrum efficiency, as well as deployment costs, the IRS is expected to outperform other related technologies, such as relays, backscatters, and massive MIMO-based active surface systems [4].
The IRS can provide another path link between the source and end-users, avoiding a communication blockage, which generally requires the careful design of IRS passive beamforming. In other words, the IRS-aided communication system benefits from the passive beamforming gain to maximize and boost the received desired signal power at the UE [5]. Moreover, the BS is generally equipped with multiple antennas for the adequate and proper design of the active beamforming. Hence, the joint design of the beamforming for the BS and IRS can further enhance the network performance. Many recent studies have adopted conventional optimization methods to conduct the jointly optimized beamforming design for IRS-aided wireless networks. More specifically, in [5], the semidefinite relaxation (SDR) technique and alternating procedure were applied to solve the joint optimization problem of the transmitting beamforming and reflecting phase shifts of a single IRS. The joint problem was then addressed in an IRS-assisted NOMA system [6], where the authors presented a difference-of-convex-based approach that overcame the drawbacks of SDR and handled the resulting optimization problem efficiently. In the same context, the authors in [7] jointly optimized RIS phase shift elements and the corresponding transmit power allocation, where the designed framework was developed based on fractional programming, gradient descent, and alternating maximization methods. In [8], the BS beamforming matrix and IRS phases were optimized through fixed-point iteration and manifold approaches to maximize the MISO system’s received signal. RIS-aided broadcasting in a physical layer was investigated in [9], where the BS transmit power was minimized to satisfy the needs of UE for quality of service (QoS). In [10], the authors investigated the optimization problem of maximizing the weighted sum rate in the MISO system enhanced by IRS, where the Lagrangian dual transform and an alternating approach were adopted. In [11], the active precoding of the transmit source and the reflection phases of the IRS were jointly tuned in the ideal CSI scenario using an iterative process, aiming to maximize the total rate. Besides all the above systems, which focus on a single IRS-assisted scenario, some works have considered multiple-IRS-assisted systems. A closed-form solution and an SDR-based approach were proposed for the joint beamforming design in [12], while the work in [13] utilized the Lagrangian and Riemannian manifold techniques to solve the BS precoding transmit matrix and phase shifts of deployed IRSs. Additionally, the authors in [14] introduced a variety of system scenarios, including a single- and multi-IRS-assisted system, and used alternating optimization to solve the designed problems for joint beamforming under the presented traffic patterns.
Although the above-presented approaches and algorithms, which adopt traditional optimization-based methods, provide reasonable performance, they are generally constrained in practice by system complexity [7]. Thus, the attention of researchers was drawn to machine learning (ML) methods because of their exceptional abilities in handling complex practical communication systems [15]. In the same perspective, machine learning (ML) is recognized as a vital enabler in implementing IRS-aided systems because of its critical role in decreasing optimization complexity [2,16]. Generally, ML learns from fresh data and refines its operations to increase the total performance, gaining “intelligence” over time to anticipate output values within an acceptable range. A comprehensive survey of recent studies on ML-based IRS-assisted systems can be found in [16].
Among the ML approaches, deep learning (DL) has shown tremendous achievements in current communication systems. Due to its exceptional learning abilities, it has been primarily applied in IRS-aided communication systems to study channel estimation (e.g., [17,18]) and beamforming optimization (e.g., [19,20,21,22]). In terms of channel estimation, the work in [17] presented a DL framework for direct and cascaded channel estimation, where a convolutional neural network-based architecture is modeled and fed by the received pilots. Meanwhile, the authors in [18] proposed a deep denoising neural network for uplink channel estimation in mmWave IRS-assisted system. Note that channel estimation approaches are strongly related to the channel model of the constructed system model, which further influences the prediction result; some recent works in the channel model can be found in [23,24], which are suitable to be applied in IRS-aided communication system. In terms of beamforming optimization, the work in [19] designed a deep neural network for the online setting and optimized the beamforming in an indoor environment. In this work, the training model was developed to maximize the final received signal strength by mapping between the UE’s position data and the optimal IRS phases. The used database was collected offline before training the designed model. A model based on unsupervised learning was developed for the IRS passive beamforming configuration [20], which had lower computational complexity than the SDR-based model. The authors of [21] used supervised learning to develop an interaction mechanism between the IRS and incident signals by leveraging the information of sampling channel vectors at the IRS’s active units. A deep learning model was presented in [22] for an IRS-assisted system. The reflection angles of the IRS and channels were estimated based on the received signal at the IRS, aiding in detecting and estimating symbols through the system. This model can enhance the bit error rate in the system compared to standard detectors.
Besides DL, through combining DL and reinforcement learning, deep reinforcement learning (DRL) is widely adopted in IRS-aided systems for optimization design. It learns by a trial-and-error mechanism and does not need the current data for outcome prediction, as in DL. In the context of the optimization of IRS reflection phases, the authors in [25] proposed a DRL algorithm for predicting the IRS reflection phases with minimal beam training overhead compared to the supervised-based approach proposed in [21]. In [11], a DRL approach for IRS reflection was presented to optimize the throughput in the imperfect CSI scenario, where the learning model was trained to predict the reflection beamforming based on the quantile regression approach for efficient convergence. Regarding the joint design of active and passive beamforming, in [26], the DRL model was built to estimate the beamforming vectors of the transmit BS and design the IRS reflect beamforming, intending to increase the MISO system sum rate at a minimal complexity. The work in [27] focused on a wireless secure communication system aided by IRS. The system secrecy rate was improved by optimizing the joint beamforming under different UEs’ QoS requirements and time-varying channels. Additionally, prioritized experience replay and post-decision state methods were applied to enhance the learning and secrecy performance.
Note that most of the above studies were conducted under a basic communication system model with only a single reflecting surface deployed or a single active user served. They neglected the direct communication links between the source and the end-users. Moreover, the applicability of the conventional alternating algorithms to find the optimal beamforming configuration is restricted by high complexity. Furthermore, some works built and trained their ML-based models to predict their outputs based on fixed channel coefficients or selected their estimated beamforming, active or passive, from a pre-defined codebook. This can be unsuitable or not applicable to the deployed systems and even limit the system’s performance due to the multiple antennas and IRSs. Thus, in this paper, we study a wireless communication system aided by multiple IRSs in which the deployed surfaces assist in communication between a source BS and multiple end-users, considering both the direct and indirect links. In wireless cellular networks, transmit power optimization is regarded as one of the most important factors in managing interference, extending the network lifetime, and maintaining connectivity [28]. Due to the battery life limitation at mobile stations, the transmit power has to be controlled to guarantee connectivity with the transmitting source while maintaining the UEs’ QoS requirements. Hence, we aim to achieve the optimal configuration of the BS active beamforming and reflection beamforming of the IRSs that meets the UE’s QoS while allowing the lowest transmit power consumption at the BS. Moreover, we design machine-learning-based approaches to find the optimal solution with lower complexity in the static UE and dynamic UE scenarios. The main contributions of this work are as follows.
  • To maintain a green wireless communication system, we formulate a minimization problem for the transmit power of the BS by jointly optimizing the beamforming of the BS and reflection phases of multiple IRSs under the QoS constraints of multiple UEs, while considering two scenarios: a static UE scenario and a dynamic UE scenario.
  • Under the case of the static UE scenario, we propose a new GRNN-based algorithm, while, under the case of the dynamic UE scenario, as the new GRNN-based algorithm is a supervised-learning-based approach that needs to collect a large data set to reach an optimal regression surface to predict solutions based on static locations for UEs. We then propose a DDPG-based algorithm that learns a policy for the joint beamforming prediction based on the dynamic surrounding environmental states, which is robust in the wireless environment.
  • Our numerical results demonstrate a power-saving system by adopting the proposed algorithms compared to the benchmark schemes. The time complexity of the algorithms is also presented, which shows the superiority of the proposed ones. Additionally, we show the effectiveness of increasing the number of reflecting elements in reducing the transmit power for the multiple-IRS-aided system.
The remainder of this work is organized as follows: Section 2 describes the system model and the formulated optimization problem. Section 3 presents in detail the proposed algorithm for the static UE scenario, while Section 4 explains the proposed algorithm for the dynamic UE scenario. Then, Section 5 presents the simulation results. Finally, this work is concluded in Section 6.
Notation: Scalars are denoted by italic letters, while vectors and matrices are denoted by boldface. The ( . ) H and . T superscripts stand for Hermitian and transpose operations, respectively, while C denotes the complex domain. CN μ , σ 2 represents a circularly symmetric complex Gaussian (CSCG) distribution with corresponding mean μ and variance σ 2 . The symbols E [ . ] , . , and . represent expectation, determinant, and Euclidean norm, respectively. Moreover, log 2 ( . ) denotes the logarithm function, d i a g ( x ) represents a diagonal matrix having diagonal entries for vector x, e x p ( . ) is used to apply the exponential function, f stands for the gradient of a function f, and O ( . ) is Landau’s symbol for complexity order.

2. System Model

2.1. Network Model

As shown in Figure 1, we consider a multiple-IRS-assisted downlink communication system, where multiple IRSs are deployed to aid transmission from the base station (BS) to the user equipment (UE). In this work, we focus on the single-cell scenario. Similar to [5], the interference from other cells is ignored. The consideration of multiple cells is left for future work. Let the set L = 1 , , L denote the multiple IRSs and the set K = 1 , , K denote the multiple UEs, respectively. The BS is assumed to be equipped with M transmit antennas, while each UE is equipped with a single antenna. In addition, each IRS is equipped with N reflecting units that are configured via a smart controller that is wirelessly linked to the BS [29]. Note that these controllers can serve as either receivers for channel estimation or reflectors for relaying signals to UEs [30]. In this paper, similar to [5,6,7,8,9,10,11], all channels are considered to be quasi-static flat-fading, and full CSI is assumed. The CSI can be obtained by using the reduced length of the pilot sequence, such as in [2]. The consideration of imperfect CSI is left for our future work. Let h d k H C 1 × M denote the direct channel from the BS to the k-th UE, while G l C N × M and h r k l H C 1 × N are the indirect channels from the BS to the l-th IRS, and from the l-th IRS to the k-th UE, respectively.
At the transmitter side, the BS performs a linear precoding transmission with one beamforming vector assigned to each served UE. Thus, the signal transmitted by the BS is expressed as
x = k = 1 K w k s k ,
where w k C M × 1 denotes the transmit beamforming vector of the k-th UE with data symbol s k . It is assumed that s k ’s are modeled as random variables with zero mean and unit variance, i.e., E [ | s k | 2 ] = 1 , k . Thus, the total consumed power transmitted at the BS is calculated as
P T = k = 1 K w k 2 .
At the reflecting side, each IRS receives the signals from the transmitting BS and reflects the tuned signals. Specifically, the incident signal, symbolized by x ^ n , x ^ n at the n-th element of the l-th IRS, is multiplied by a constant-modulus reflection coefficient. Thus, the reflected signal of the n-th element at the l-th IRS can be expressed as
y ^ l , n = e j θ l , n x ^ n , l , n ,
where θ l , n ϵ [ 0 , 2 π ) denotes the angle of the n-th unit at the l-th IRS. As such, the reflected signal at each l-th IRS is expressed as
y ^ l = Θ l x ^ , l ,
where Θ l = d i a g ( e j θ l , 1 , , e j θ l , N ) ϵ C N × N , y ^ l = [ y ^ l , 1 , , y ^ l , N ] T , and x ^ = [ x ^ 1 , , x ^ N ] T .
At the receive side, the k-th UE at the location of ( x k , y k , z k ) receives signals from both the BS and the IRSs. Due to the high path loss, only the signals reflected by the IRSs are considered once. Hence, the total received signal at the k-th UE is given by
y k = l L h r k l H Θ l G l + h d k H w k s k Desired signal + j k l L h r k l H Θ l G l + h d k H w j s j Interference + n k Noise , k K ,
where n k CN 0 , σ k 2 is additive white Gaussian noise with a mean of zero and a variance of σ k 2 , k K . Hence, the signal-to-interference-plus-noise ratio (SINR) measured at the k-th UE is calculated as
S I N R k = l L h r k l H Θ l G l + h d k H w k 2 j k l L h r k l H Θ l G l + h d k H w j 2 + σ k 2 , k K .
Additionally, the achievable rate is given by
R k = log 2 ( 1 + S I N R k ) , k K .

2.2. Formulation and Analysis of the Optimization Problem

The focus of this work is to minimize the total power transmitted by the BS through jointly optimizing the active transmit beamforming matrix, W , and passive beamforming matrices, { Θ l } l = 1 L , subject to users’ SINR constraints, which is formulated as follows.
P 1 : min W , { Θ l } l = 1 L k = 1 K w k 2
s . t . S I N R k γ k , l , k ,
0 θ l , n 2 π , l , n ,
where W = { w 1 , w K } C M × K , { Θ l } l = 1 L = { Θ 1 , Θ l , , Θ L } , and γ k is the SINR threshold of the k-th UE. The problem is non-convex due to the coupling in the UEs’ SINR constraints (8b) between the optimization variables.
One conventional method for solving the above problem is the alternating optimization approach, as in [5,14]. Generally, the AO separates the main problem into two sub-problems, i.e., one for fixed IRSs’ beamforming and another for fixed BS beamforming, iteratively optimizing the solutions until convergence is met. Specifically, by initializing the phases of all IRS matrices, { Θ l } l = 1 L , to ( P 1 ) , the corresponding BS beamforming, W , can be found by solving the following sub-problem ( P 2 ) :
P 2 : min W k = 1 K w k 2
s . t . | h k H w k | 2 j k | h k H w j | 2 + σ k 2 γ k , l , k ,
where h k H = l L h r k l H Θ l G l + h d k H is the total calculated channel coefficient between the source BS and the k-th UE in the system. The second-order cone program (SOCP) [31] can be used to solve this convex problem.
On the other hand, after optimizing the BS beamforming matrix, W , the IRSs’ reflection phase matrices are optimized using the following reduced feasibility check problem ( P 3 ) :
P 3 : Find { Θ l } l = 1 L
s . t . | ( l L h r k l H Θ l G l + h d k H ) w k | 2 j k | ( l L h r k l H Θ l G l + h d k H ) w j | 2 + σ k 2 γ k , l , k ,
0 θ l , n 2 π , l , n .
It is possible to relax this problem to a convex SDP feasibility check problem, which can then be solved by the efficient convex solver (CVX) [32], or improve the converged level by transforming it into an optimization problem with an objective function [5,14].
However, the iterative process of AO is time-consuming as it needs to solve many sub-optimization problems until reaching a defined acceptable level of convergence, which makes it impractical in a highly dynamic environment with a mobility case of UEs. Motivated by these aspects, we develop machine-learning-based algorithms for the online implementation of the active and passive beamforming by considering the static UE scenario and the dynamic UE scenario in the following sections.

3. Proposed Algorithm for Static UE Scenario

In this section, we first introduce our designed new GRNN-based algorithm for the static UE scenario. Then, the model training and validation are described.

3.1. The Proposed New GRNN-Based Algorithm

We design a new GRNN-based algorithm for the joint optimization of active and passive beamforming. The model structure is based on a new type of radial basis network (RBN) called new GRNN [33] that can effectively handle our non-linear regression problem with its continuous optimization variables. A key feature of the network is its ability to design and learn quickly. Moreover, the model’s performance is further enhanced by collecting large data sets and fine-tuning its free parameters in order to converge to an optimal regression surface and gain superior training efficiency. This proposed algorithm mainly contains three parts, namely data preprocessing, data processing, and data representation. The details of each phase are described below.

3.1.1. Data Preprocessing

To solve ( P 1 ) based on the new GRNN model, the direct and indirect path channel matrices between all nodes in our presented system, i.e., { G l } l = 1 L , { H r l } l = 1 L , and H d , l , k , logically can be used as the model inputs, where H r l = [ h r 1 l , h r 2 l , , h r K l ] C N × K , and H d = [ h d 1 , h d 2 , , h d K ] C M × K . However, these may induce many features to be inputted and processed through the model layers as they are proportional to the number of reflecting elements, which may affect the model’s performance. Hence, a more appropriate input is to use only one matrix for the indirect path channel between the BS and UEs through each IRS. In other words, the products G l H H r l are adopted as model inputs, rather than G l and H r l for each l-th connected IRS. As shown in Figure 2, { G R l } l = 1 L and H d represent the model total inputs, where G R l = G l H H r l . Additionally, both the real and imaginary values of the model inputs are treated as two separate features. By applying this preprocessing step to the model, the input dimension can be reduced, as well as exploiting the internal structural information in indirect paths, which improves the model’s efficiency. It can be calculated that the final dimension of the input vector is 2 ( L M K + M K ) . To ensure reliability and stability, all data are normalized using Min–Max normalization, calculated as
a = ( a max a min ) [ b b min b max b min ] + a min ,
where a and b denote the normalized and corresponding non-normalized values, respectively, in our model, while a m i n and a m a x are constants to determine the range of the normalized values; we set the a m a x = 1 and a m i n = 0 , which fits well with the collected data set distribution following various tests and tries.

3.1.2. Data Processing

After the preprocessing stage, the data are processed through the model layers, namely an input layer, an output layer, and in between two processing layers, namely a radial basis (pattern) layer, and a regression (summation) layer, as illustrated in Figure 2.
  • The input layer, which has the same number of neurons as the number of inputs, receives channel instances of the learning samples of both the direct and indirect channel path coefficients after preprocessing, as described in the previous section.
  • The pattern layer comprises radbas (Gaussian) neurons, with a total number of neurons equal to the number of training samples received by the input layer. Layer units will store the non-linear relationship constructed between the generated response and corresponding input, e.g., bypassing the input via each unit in the pattern layer. Weights of this layer are computed as the transpose of the input channel coefficient vectors, P , i.e., ( P ) T . At the same time, biases are ( 0.8326 / s p r e a d ) , where the spread represents the width of a radial basis activation function used, i.e., spread ≤ 1. Then, the i-th pattern neuron’s equivalent transfer function is given by
    P i = exp ( D i 2 / 2 ( S p r e a d ) 2 ) , i = 1 , 2 , , m ,
    where D i 2 = ( I I i ) T ( I I i ) denotes the squared calculated distance between the input vector I and the learned vector I i of the i-th neuron, and m is the cardinality of pattern layer neurons.
  • The regression layer consists of linear neurons whose weights are set to the target values, denoted by T. Two types of summation operations are performed in this layer, which are symbolized as S d and S k , defined as
    S d = i = 1 m P i ,
    S k = i = 1 m R i j P i ,
    where R i j is the training response of the corresponding i-th input and j-th output.
  • In the end, the output layer uses a purlin linear activation function to formulate the final optimized output. For each output neuron, the predicted output in the model is calculated as
v k = S k S d = i = 1 m R i j exp ( ( I I i ) T ( I I i ) 2 ( S p r e a d ) 2 ) i = 1 m exp ( ( I I i ) T ( I I i ) 2 ( S p r e a d ) 2 ) , k = 1 , 2 , , n .
where n is the cardinality of output layer neurons; these outputs represent our optimization variables in ( P 1 ) , with the first neurons representing active beamforming and others representing passive beamforming phases, as will be described below.

3.1.3. Data Representation

At the output, the model is trained to predict the active and passive beamforming of the corresponding BS and L connected IRSs. Thus, the final total output can be represented as { W , { θ l } l = 1 L } , where θ l is N phases of the l-th IRS, i.e., θ l = d i a g ( Θ l ) . Similarly, the optimized beamforming matrices’ real and imaginary coefficients are considered. Then, the proposed model has an output vector with a dimension of 2 ( M K + 3 N ) .

3.2. Model Training and Validation

The network is trained in offline mode using a large dataset collected by a diverse range of channel instances under various requirements for end-users. It is designed to minimize the mean absolute error (MAE) and the mean square error (MSE) between the optimized vectors derived from the alternating approach, V , and estimated vectors derived from the trained model V ^ . The mathematical definitions of MAE and MSE for the proposed model are, respectively, given by
MAE = 1 n i = 1 n V i V i ^ ,
MSE = 1 n i = 1 n ( V i V i ^ ) 2 .
During training, cross-validation is adopted in the model to fine-tune its network parameters and evaluate its performance. Cross-validation is an important evaluation method for assessing the performance of ML models and avoiding overfitting issues on new data sets [34]. In this work, K-fold cross-validation is employed, which is the most widely used technique and fits well with the data collected. Under the K-fold cross-validation, the entire data set is divided into K randomly sized subsets (folds), one of which is used for validation, and the K-1 remaining folds are used for training the model. The test MSE or MAE is calculated for the validation subset to be a performance indicator. A K times repetition of this procedure is performed, validating each fold subset exactly once. Finally, the model with the fewest validation errors is chosen.

4. Proposed Algorithm for Dynamic UE Scenario

In this section, we first clarify the MDP formulation for the problem. Then, a detailed description of the proposed DDPG-based algorithm for the scenario with dynamic UEs is presented.

4.1. MDP Formulation

As can be noted, solving the expressed optimization problem ( P 1 ) by the traditional alternating approach is challenged by high complexity. Meanwhile, it needs to alternate over the relaxed sub-problems with the driven convex formulations to reach converged solutions of the coupled decision variables of the beamforming. Moreover, solving the optimization problem with the proposed algorithm presented in Section 3 eliminates the AO complexity at run time. However, it is still restricted to the static UE scenario. This is because the algorithm developed in Section 3 is a supervised-learning-based approach that needs to collect a large data set to reach an optimal regression surface to predict solutions based on static locations for UEs. Thus, when considering mobile UEs in a dynamic scenario, this algorithm cannot efficiently work. As motivated by DRL, the optimal beamforming strategy can be formed from observing the network states and the returned reward in the case of dynamic UEs. Thus, we propose a DRL-based approach to optimally solve the optimization problem and efficiently handle the dynamic UE scenario. Note that although the algorithm proposed in this section can work for the static UE case, it is still meaningful to adopt the algorithm in Section 3. This is because the GRNN model can be deployed more quickly with a ready data set of feasible samples compared to the DRL-based model, which needs sufficient training to form its optimal decision policy.
To utilize DRL, the first step is to convert the optimization problem into an MDP form. The MDP generally consists of five fundamental elements, i.e., agent, state s t , action a t , reward function r t , and policy π t [35]. As shown in Figure 3a, at each time step t, the agent receives instant observation or state s t S from the environment, where S is the possible set of states, and responds to that state by selecting an action a t A , where A is the set of possible actions. Then, the environment updates its state to the next state s t + 1 and provides performance metric reward r t R to this state–action pair ( s t , a t ) . The mapping from states to actions based on the agent’s policy, π t ( s t , a t ) , denotes the probability of choosing action a t for the current state, s t . Over time, the agent aims to learn an optimal policy by maximizing the rewards received. Under our considered optimization problem, since the source BS decides on the optimal beamforming based on the surrounding environmental state, it functions as an agent. The state, action, rewards, and policy are described below.
  • State: The state s t consists of all the current observed channel matrices of direct and indirect paths at time step t, i.e., ( { G R l } l = 1 L ) t and ( H d ) t ; the previous transmission power from the source to each k-th active UE at time step t 1 , i.e., ( P k t r a n s ) t 1 ; and the previously received powers at each k-th active UE at time step t 1 , i.e., ( P k r e c ) t 1 . Here, the transmission power for each k-th UE is defined as ( P k t r a n s ) t 1 = ( w k 2 ) t 1 , which leads to K inputs stacked to the state. As each k-th UE in the system can receive its desired signal and interference signals due to other k 1 UEs, the received power for each k-th UE is defined as ( P k r e c ) t 1 = | h k H W | 2 . This induces K inputs for each UE; consequently, we have K 2 stacked in total to the state. Mathematically, the state at the time step t is given by
    s t = { ( { G R l } l = 1 L ) t , ( H d ) t , ( { P k t r a n s } k = 1 K ) t 1 , ( { P k r e c } k = 1 K ) t 1 } S .
    Note that since the complex-valued features of channel inputs are treated as two separate features, the instant state dimension is 2 ( L M K + M K ) + K + K 2 .
  • Action: The action a t at the time step t is constructed as the optimized variables of problem ( P 1 ) , i.e., the active beamforming, ( W ) t , and the passive beamforming vectors of all deployed IRSs, { ( { θ l } l = 1 L ) t } . Thus, the action can be defined as
    a t = { ( W ) t , ( { θ l } l = 1 L ) t } A .
    Correspondingly, the instant action dimension is simply 2 ( M K + L N ) , where the complex-valued form is considered.
  • Reward: Since the total transmit power minimization at the source is the objective function of the presented optimization problem, linking the objective function and the reward will permit the achievement of MDP’s goal of increasing long-term rewards. Accordingly, the instant reward feedback by the environment in this model is defined as the overall energy efficiency, i.e., the ratio between the total achieved rate and total transmitted power. Considering the SINR constraints, the SINR of each UE should be above the SINR threshold to satisfy the required QoS threshold. When the resulting UE’s SINR is below the threshold, a penalty is applied to help the agent to alter the inappropriate beamforming. In our designed reward function, we impose the penalty value as zero or can simply assume that an unsuitable action leads to the Inf value of the corresponding total transmit power. Hence, the instant reward at each time step t can be calculated as follows:
    r t = 0 , if S I N R k < γ k , k K , k = 1 K R k k = 1 K w k 2 , otherwise .
  • Policy: The policy adopted in this work is a deterministic policy that learns to select the best possible actions of active and passive beamforming depending on different states, as will be discussed in the following section.

4.2. The Proposed DDPG-Based Algorithm

Note that value-based algorithms (i.e., Q learning) train discrete action spaces only; however, in this work, as the action space is continuous, these algorithms cannot be applied. Moreover, although policy-based algorithms (i.e., the policy gradient) can handle continuous or discrete action spaces, they still have limited convergence performance. Thus, to overcome these limitations, the actor-critic (DDPG) learning algorithm is adopted in this work to handle the designed optimization problem efficiently with continuous spaces.
As a model-free, online, off-policy reinforcement learning approach, the DDPG algorithm learns and trains two main deep neural networks, namely the actor-network and the critic network, as shown in Figure 3a. Specifically, the actor-network is a deterministic policy network, μ ( s ; θ μ ) , trained to choose the optimal action for a current state with the parameters θ μ . The critic network Q ( s , a ; θ q ) evaluates the performance of the actor’s output by the estimated Q-value. Therein, the network parameters θ μ , θ q are updated during training by a deterministic policy gradient algorithm and gradients obtained from the temporal difference (i.e., TD) error signal, respectively. Since the DDPG aims to approach the optimal output Q-value without knowing the actual target value, it may cause divergence in the Q-network. To tackle this issue, two copies of networks are created for both the actor and critic networks, respectively, which are called target networks and responsible for target values dented as μ ( s ; θ μ ) , and Q ( s , a ; θ q ) , respectively. The target networks share an identical structure with the original training networks, but the parameters are different. Additionally, the DDPG algorithm connects with a replay memory to store all trained experiences in the form { s t , a t , r t , s t + 1 }. This constructed memory is accessed during network updates by a random mini-batch that helps to eliminate the correlation between samples. Moreover, to guarantee exploration performance with the learning deterministic policy, the DDPG employs an exploration policy on the selected actions during the training process. Generally, to construct the exploration policy μ ˜ , the noise sampled from a noise function N is added to a chosen action, as follows:
μ ˜ ( s t ) = μ ( s t ; θ μ ) + N t .

4.2.1. Description of the Proposed DDPG-Based Algorithm

Based on the aforementioned structure of the DDPG, we propose a DDPG-based algorithm to solve the optimization problem ( P 1 ) in the dynamic UE scenario, as summarized in Algorithm 1. At the start, the main and target networks are initialized; specifically, the four training and targeting networks are initialized with uniformly distributed parameters θ μ = θ μ , and θ q = θ q . Additionally, all model hyper-parameters for training are initialized. Then, at the training stage, the agent’s interactions in the surrounding environment are split into I episodes with finite T steps in each episode. During each training episode, the environment is reset to generate new UEs’ positions and channel gains. Then, all direct and indirect channels between network nodes are calculated, and the beamforming vectors are randomly initialized. Accordingly, all the transmission and received powers are calculated; the state s t is formulated by (18) and inputs as excitation to the actor, which feeds back the decision a t by applying the current policy and adding generated sampled noise N t , extracting the BS active beamforming and the deployed IRSs’ passive beamforming from this action, whereby the current reward r t is calculated by (20) and the next state s t + 1 is formulated. Then, storing the transition { s t , a t , r t , s t + 1 }, the experience buffer is created. The algorithm needs at least N B experience to start updating and learning networks; thus, the previous steps are repeated N B times to form the starting mini-batch only in the first episode. Once the N B experience is created, the algorithm will start training.
At each time step t, a random N B sample is chosen from the replay. Accordingly, the target value function is calculated by the target network as the sum of the instantaneous reward and the discounted future reward as
y t = r t + γ Q ( s t + 1 , μ ( s t + 1 ; θ μ ) ; θ q ) .
Note that if the s t + 1 is a terminal state, then y t = r t . Then, the critic network is trained to minimize the critic loss function by applying the gradient descent approach, which is defined as
L ( θ q ) = E [ ( y t Q ( s t , a t ; θ q ) ) 2 ] .
By using the replay memory and randomly selecting a N B mini-batch of samples during training, the critic network is updated by minimizing the critic loss function across all sampled experiences as follows:
L = 1 / N B t = 1 N B ( y t Q ( s t , a t ; θ q ) ) 2 .
Afterward, the actor-network is updated using a sampled policy gradient by applying the derivative of the policy’s performance J with respect to the parameters of the actor θ μ , which is calculated as
θ μ J = 1 / N B t = 1 N B a Q ( s t , μ ( s t ; θ μ ) ; θ q ) G a t | θ μ μ ( s t ; θ μ ) G μ t ,
where the G a t term represents the gradient of the critic output with respect to the actor action, and G μ t is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation s t . Then, the target actor and critic parameters are updated by using a smoothing update, which is mathematically calculated as follows:
θ μ = τ θ μ + ( 1 τ ) θ μ ,
θ q = τ θ q + ( 1 τ ) θ q ,
where τ < < 1 is the target smooth factor.
Algorithm 1 The Proposed DDPG-Based Algorithm
1:
Initialize the training networks for actor and critic μ ( s ; θ μ ) and Q ( s , a ; θ q ) , with initial parameters θ μ and θ q , respectively.
2:
Initialize the target networks for actor and critic μ ( s ; θ μ ) and Q ( s , a ; θ q ) with copied parameters as θ μ θ μ and θ q θ q , respectively.
3:
Initialize actor and critic learning rates μ a and μ c , an empty experience replay D, mini-batch size N B , target smooth factor τ , and discount factor γ .
4:
for episode i = 1 , 2 , I  do
5:
   Randomly set ( x k , y k , z k ) , k , obtain { G l } l = 1 L , { H r l } l = 1 L , H d , l , k .
6:
   Initialize { θ l } l = 1 L , and W .
7:
   Form initial state s 1 using (18).
8:
   for t=1,2,…T do
9:
     Select an action as a t = μ ( s t ; θ μ ) + N t .
10:
     Form the BS beamforing W and IRS beamforming matrices { Θ l } l = 1 L from a t .
11:
     Obtain instant reward r t using (28).
12:
     Obtain next state s t + 1 using (18).
13:
     Store the experience { s t , a t , r t , s t + 1 } in buffer D.
14:
     if update then
15:
        Sample a random mini-batch of N B experiences { s t , a t , r t , s t + 1 } from the buffer.
16:
        Set the value function target as
y t = r t , if s t + 1 is terminal state , r t + γ Q ( s t + 1 , μ ( s t + 1 ; θ μ ) ; θ q ) , otherwise .
17:
        Update the main critic parameters by minimizing the loss in (24).
18:
        Update the main actor parameters by sampled policy gradient in (25).
19:
        Update the target network parameters θ μ and θ q every U steps by smoothing in (26) and (27).
20:
     end if
21:
     Set new input state as s t + 1 .
22:
   end for
23:
end for
In the implementation stage, the agent observes the current state of the surrounding environment and chooses active and passive beamforming based on the trained model. After this, it observes the new state for a new decision. Here, we assume that the UEs’ locations are known to the BS, i.e., the location information can be predicted through processing the estimated channels and received signals from the users in the uplink communication and extracting location-based information, i.e., angle of arrival (AoA) and departure (AoD), path loss coefficients, and time of arrival (ToA) [36]. Note that the training stage is performed offline using a high-performance computational server, while the implementation is done online. Thus, we can focus on the implementation complexity, as will be analyzed in the following section.

4.2.2. Structure of the Actor and Critic Networks

In the proposed DDPG algorithm, both the actor and critic networks are fully connected deep neural networks (DNNs) with the same structure. Each network is composed of five layers, as shown in Figure 3b, i.e., one input layer, one output layer, two hidden layers, and, in between, a normalization layer. The input and output dimensions of the actor are equal to the state cardinality and action set cardinality, respectively. Meanwhile, in the critic network, these dimensions change to the cardinality of the state plus action for the input and one for the output (i.e., representing the Q-value function). Note that the state and action are treated independently and then concatenated within the critic to be processed as one input. The hidden layer neurons are set to 400 for each layer, which is found to be suitable for the number of BS transmitting antennas, the number of active UEs, and the number of IRS elements considered in our system model [26]. Between the hidden layers, a normalization layer is used to speed up the algorithm’s convergence and reduce the training time. As for the activation functions, the tanh and relu models are utilized, which can facilitate backpropagation and gradient descent. In addition, the networks are trained according to the Adam optimizer with decay rates of the gradient and squared gradient moving averages that are set as 0.9 and 0.999, respectively. To ensure the adequate exploration of the action set, the noise model N is added to the action, which is chosen as complex Gaussian noise with a mean of zero and a variance meeting V a r i a n c e T s = ( 1 % to 10 % ) of the action range, where T s is the sampling time. Other hyper-parameters used in the model are listed in Table 1, which were found to be the best tuning values for our optimization problem after some trials.

5. Simulation Results

In this section, we present the numerical results for both the static UE scenario and the dynamic UE scenario. The considered system’s 2D deployment is depicted in Figure 4, where the BS is positioned at ( x B , y B , z B ) = (−200,0,10), and three IRSs are deployed at position coordinates denoted by ( x l , y l , z l ) at (0,−100,10), (100,0,10), and (0,100,10), respectively, with the same BS height. We consider that each IRS surface deployed on a 2D rectangular grid has N = N H N V reflecting elements, where N H and N V denote the number of elements per row and column, respectively. There are K = 4 UEs randomly placed in a circle with a radius of R = 80 m centered at the origin.
As for the channel model, the BS-UE channels are assumed to follow Rayleigh fading, i.e., h d k H = β 0 , k h ˜ d k H , where h ˜ d k H is modeled as CN ( 0 , 1 ) , and β 0 , k is the corresponding path loss between the BS and k-th UE modeled as 32.6 + 36.7 l o g 10 ( d d i r e c t ) [dB] [10], where d d i r e c t is the distance between the BS and k-th UE of the direct link. The IRSs are deployed to maintain the line-of-sight (LOS) links with the BS and with the UEs. Thus, the channels of the BS-IRSs and IRSs-UEs are modeled as Rician fading, i.e., G l = β 1 , l ( ε 1 + ε G ˜ l ( L O S ) + 1 1 + ε G ˜ l ( N L O S ) ) and h r k l H = β 2 , k l ( ε 1 + ε h ˜ r k l H ( L O S ) + 1 1 + ε h ˜ r k l H ( N L O S ) ) , respectively, where β 1 , l and β 2 , k l are the path loss between the BS and l-th IRS, and between the l-th IRS and k-th UE, respectively. The path loss in indirect channels is modeled as 35.6 + 22 l o g 10 ( d i n d i r e c t ) [dB] [10], where d i n d i r e c t is the corresponding distance between nodes in indirect links, i.e., BS-l-th IRS or l-th-k-th UE. Here, ε is the Rician factor, set as 10 in the simulations. The elements of NLOS components G ˜ l N L O S and h ˜ r k l H ( N L O S ) follow the distribution of CN ( 0 , 1 ) . The LOS components can be constructed as [37], i.e., G ˜ l L O S = a I R S ( ϑ 1 , l , ψ 1 , l ) a B S ( ϑ 1 , ψ 1 ) H and h ˜ r k l H ( L O S ) = a I R S ( ϑ l , k , ψ l , k ) H , where a I R S and a B S are the IRS and BS steering vectors, respectively, with corresponding azimuth and elevation angles, i.e., ϑ and ψ . The n-th element of the l-th IRS steering vector is a I R S ( ϑ l , k , ψ l , k ) ] n = e 2 π d l / λ λ c { i 1 ( n ) sin ( ϑ l , k ) c o s ( ψ l , k ) + i 2 ( n ) s i n ( ψ l , k ) } , where d l is the distance separating the l-th IRS adjacent elements, λ c is the wavelength of carrier, i 1 ( n ) = m o d ( n 1 , N H ) , i 2 ( n ) = ( n 1 ) / N H . In the simulations, we set 2 d l = λ c , fix N H = 2 , and increase N V in our deployed IRSs. Additionally, sin ( ϑ l , k ) cos ( ψ l , k ) = ( y k y l ) / d k l and sin ψ l , k = ( z k z l ) / d k l , where d k l is the distance between the l-th IRS and k-th UE. Here, the steering vector a B S ( ϑ 1 , ψ 1 ) is calculated as a B S ( ϑ 1 , ψ 1 ) = [ 1 , , e 2 π ( M 1 ) d B λ c cos ( ϑ 1 ) cos ( ψ 1 ) ] , where the d B is the distance separating BS adjacent antennas; we set 2 d B = λ c , and cos ( ϑ 1 ) cos ( ψ 1 ) = ( x l x B ) / d B l , where d B l is the distance between the l-th IRS and the BS. Similarly, the steering vector a I R S ( ϑ 1 , l , ψ 1 , l ) can be calculated, in which sin ( ϑ 1 , l ) cos ( ψ 1 , l ) = ( y B y l ) / d B l , and sin ( ψ 1 , l ) = ( z B z l ) / d B l . Note that channel reciprocity is assumed between downlink and uplink channels. Other simulation parameters include noise power σ k 2 = 80 dBm, AO algorithm threshold ϵ = 10 1 , and 1000 random vectors in the Gaussian randomization.
To demonstrate the effectiveness of our proposed algorithms, we compare our algorithms with the benchmark schemes. The considered benchmark schemes and scenarios are as follows.
  • Without IRSs: no IRSs are deployed in the system, i.e., solving ( P 1 ) with L = 0 .
  • Random Phases: all IRS elements are set to be random phases and then we optimize W by solving ( P 2 ) .
  • Iterative Approach: iteratively solving P 2 by random phases and, through iterations, update with variables, leading to minimum transmit power until some criteria are achieved.
  • Alternating Optimization: ( P 1 ) is solved in conventional alternating manner until convergence is achieved [5,14].
  • AO Heuristic: ( P 1 ) is solved by optimized near-optimum solutions obtained through the AO approach [38].
Simulation results are generated using the MATLAB environment R2021b. Moreover, the ML and DRL toolboxes, the CVX package 2.2 version, and the SeDuMi solver are used.

5.1. Performance Analysis for the Static UE Scenario

In this subsection, we present the numerical results for the static UE scenario, where the location of UEs is generated from one random realization, as depicted in Figure 4.
For our proposed new GRNN-based algorithm, training and validation are performance keys; hence, we first present the effects of key parameters (i.e., the spread factor) to facilitate the selection of training and validation sets. Here, supervised offline learning is used to train the proposed algorithm network based on a generalized dataset of 220,000 samples collected from AO results, where the SINR requirements are in the range of [−10 dB to 0 dB]. Figure 5 plots the mean absolute value versus the spread factor by considering a 10-fold cross-validation, which means that the model is trained with 90 % of the data set while 10 % is used for the validation testing. From the figure, we can see that all folds achieved the lowest validation errors at a range [0.28–0.34] of the spread factor, while spread values greater than this value range induce model overfitting. Moreover, the curve with fold 1 can achieve the best MAE among the different numbers of folds. This is because the training network across fold 1 outperforms the other folds in the validation performance.
Then, we study the training and validation performance in the validated range of [0.28–0.34] that has the lower MAE. Table 2 shows the training and validation mean error values under different spread factors. V M A E and V M S E denote the validation mean absolute and square errors, respectively. T M A E and T M S E denote the training mean absolute and square errors, respectively. In the table, it can be seen that as the spread factor increases, the validation errors decrease, but the training error increases. Nonetheless, when the spread factor equals 0.3, the validation error remains constant and then increases. Thus, to compromise between validation and training errors, a 0.3 spread factor is chosen to train the model, which corresponds to the best performance.
We then compare our proposed GRNN-based algorithm with the considered benchmark schemes in terms of the optimized total transmit power. Figure 6 plots the optimized transmit power at the source BS versus the SINR threshold for UEs under different schemes, where each simulated point is averaged through 200 random initializations. From the figure, it is clear that the optimized transmit power from our proposed GRNN-based algorithm is the lowest compared to other benchmark schemes, demonstrating our algorithm’s superiority. Compared to the scenarios with deployed IRSs, the power consumption for the scenario without IRSs is the highest, indicating the energy performance gain brought by the deployed IRSs. Then, the random phase scheme has the second-worst performance. This implies that the phases of IRS elements should be optimized rather than randomly chosen, to achieve the maximum gain from their deployment. Moreover, when comparing the iterative scheme with the AO algorithm, we can see that the performance of the AO algorithm is better than the performance of the iterative algorithm. This is because the performance of both approaches depends on the starting phase and stopping criteria. The iterative approach updates its beamforming during iterations with values that lead to the minimum transmit power until 200 iterations are reached; however, the AO stopping criterion is that the fractional decrease in objective value is less than the AO threshold level ϵ , as defined before. Clearly, AO optimizes the beamforming at every iteration, leading to less power than the optimized one in the previous iteration. Nonetheless, the iterative approach depends on the random phases generated at every iteration, which may be the reason for the power reduction noticed after some iterations. Finally, at the same converged level, the AO heuristic algorithm can find better-optimized solutions than the AO algorithm, which highlights the effect of the initial starting phases on alternating approaches.

5.2. Performance Analysis for the Dynamic UE Scenario

In this section, we concentrate on the dynamic UE scenario. Under this case, the random walk of UEs is adopted to imitate the change in locations for UEs. Each UE is assumed to move at speed v k ( 0 v k 1 ) [m/s] and moving angle g k ( 0 g k 2 π ), where the UEs are confined in the region defined at the right half circle in front of the IRSs, as in Figure 4.
The proposed DDPG-based algorithm should be trained in such a way that it results in high output performance. The network is traced during episode steps’ training to validate the algorithm’s performance during training and updating networks’ parameters. Figure 7 plots the instant and average rewards over episode steps. These average rewards are calculated as cumulative from instant rewards during episode steps. It is shown that the model starts by receiving low values of rewards from the environment at early steps, and by increasing the number of steps, these rewards increase until convergence happens. This is mainly because the training samples are selected from the replay buffer. In the early steps, the experience in the buffer is slightly less, which may lead to a high correlation between samples and, thus, less training performance. Nonetheless, by increasing the experiences in the buffer, the model trains towards increasing the long-term received rewards.
Under our proposed algorithm, the network parameters directly impact the performance and convergence rate; hence, we investigate the effects of two main critical hyper-parameters, i.e., the actor and critic learning rates and target smoothing factors, on our designed model in the following.
Figure 8 plots the average reward versus the episode steps under different learning rates. The figure shows that the curve with a learning rate of 0.001 can achieve the highest average rewards. The performance gain obtained from the smaller learning rate value comes at the expense of a long time for convergence. In addition, it is worth noting that the learning rate of 0.1 has the worst performance, as, for too high rates, the oscillations increase and degrade the performance. This implies that the learning rate cannot be too low or too high as to compromise performance and convergence. Additionally, Figure 9 plots the optimized transmit power at the source BS versus the episode steps under the different learning rates. The figure shows that the model has the best-optimized transmitted power under the learning rate of 0.001, where the transmitted power decreases over steps and reaches optimal values.
Additionally, applying high learning rate values leads to high transmit power values or increases the oscillation values, as in the case of 0.1 rates. Accordingly, we set the learning rates for both the actor and the critic in our model to be 0.001, which ensures long-term rewards and satisfies our objective of minimum transmit power.
Another important parameter that should be tuned for the model is the smoothing factor for the target actor and critic parameters updates. Generally, these values need to be set to less than 1 to guarantee smoothness. Figure 10 plots the average rewards versus the episode steps under different smoothing factors. We consider both low- and high-value ranges. It is observed that applying 0.001 as a smoothing value can achieve the best performance compared to other values. Furthermore, we observe that increasing this value reduces the average rewards and worsens the performance. As shown, applying 0.5 as a smoothing value degrades the model’s performance and decreases the received rewards. This implies that applying low smoothing factor values to the model is well suited for updating the target values and converging to optimal active and passive beamforming for the designed system. Accordingly, we apply 0.001 as a smoothing factor update for both the actor and the critic.
Finally, Figure 11 plots the average total transmit power during algorithm learning episodes with tuned parameters. The average here is calculated as the sum of cumulative values during the total learning episodes. Note that during training, in each episode, the environment is reset to new positions for UEs with new channel gains. It can be shown that the average total transmit power of the BS is decreased with the increasing episodes of training until convergence occurs. This means that the algorithm updates its policy after each episode towards achieving its objective with the minimum transmit power under different positions for UEs with different channel gains.
We then compare our proposed DDPG-based algorithm with the considered benchmark schemes in terms of the optimized total transmit power. Figure 12 plots the optimized total transmit power versus the SINR threshold under different schemes, where each simulation point is averaged over 200 random initializations. As demonstrated in the figure, by adapting to the proposed DDPG-based algorithm, the source BS can transmit the minimum power compared to other schemes, which indicates the superiority of our algorithm and its applicability to reach an optimal beamforming configuration. Similar to Figure 6, under the dynamic UE scenario, a system without aid from the IRS leads the source to transmit the highest power consumption. Moreover, while choosing random phases for the reflected elements shows less averaged power than without any deployed IRS, its power is still constrained by the passive beamforming gains obtained by these random phases. Other approaches show sub-optimal solutions, i.e., iterative, AO, and AO heuristic, which depend on the iterative convergence level.
Figure 13 plots the optimized total transmit power versus the number of units per each reflecting IRS under different schemes. Each simulated point is averaged through 200 random initializations with an SINR threshold of 0 dB for each UE. It is observed that the required minimum transmit power for the considered schemes decreases with the increasing number of reflecting units, except for the system without IRSs. This is because increasing the number of reflecting elements on the deployed surfaces with well-optimized phases will increase the gains brought by these surfaces, which helps the source BS to transmit less power to satisfy the UEs’ constraints. Moreover, the random phase scheme has the second-worst performance compared to those without IRSs. This implies the gain brought by IRSs through the indirect channel paths. Other approaches show sub-optimal solutions, i.e., iterative, AO, and AO heuristic, which depend on the iterative convergence level and imply that the phases of IRS elements should be optimized rather than randomly chosen, to achieve the maximum gain by increasing the number of IRS elements. Again, as shown in the figure, the proposed DDPG-based algorithm can achieve the lowest transmit power compared to other benchmark schemes.

5.3. Complexity Analysis

As in our work, one of the most motivating reasons for incorporating ML models into IRS-aided systems is their major advantage of reducing the run time complexity. In this section, we evaluate the time complexity of adapting the proposed algorithms compared to the considered benchmark schemes.
Table 3 compares the average prediction run time for different schemes by considering different SINR thresholds in the static UE scenario. The average prediction run time is calculated from the online implementation of the algorithms to predict the beamforming and averaged through 200 random initializations. As shown in Table 3, under the static UE case, the average prediction run time for the proposed DDPG-based algorithm is far less than for the other three schemes. This is because it does not need to solve any sub-optimization problems, as in the case of alternative-based approaches, and its structure does not depend on the size of a data set, as in the case of the GRNN-based algorithm. The second-best algorithm from the perspective of complexity is the proposed GRNN-based algorithm. For the proposed GRNN-based algorithm, the network structure and the intermediate neuron units are affected by the data set size, as described in the data processing stage of the algorithm. Regarding the AO algorithm and AO heuristic methods, they are required to solve many sub-problems alternatively until convergence is reached. Theoretically, it can be computed that the complexity of the alternating-based algorithms is O ( 2 I t ) , where I t is the number of converged iterations. For example, at 10 dB, the AO approach requires, on average, solving ten sub-optimization problems to converge within an equivalent of five iterations. Furthermore, applying the proposed AO heuristic, which starts the AO method with initial near-optimal phases rather than random ones and may be far from optimal solutions, drops the number of convergent iterations to, on average, two iterations. In conclusion, the DDPG-based algorithm performs better than other schemes in online implementation, but it still needs adequate offline training.
Table 4 shows the average prediction run time for different schemes by considering the SINR threshold of 0 dB and different numbers of reflected elements per IRS in the dynamic UE scenario. Table 4 shows that the average prediction time increases for alternating-based approaches as the number of reflecting elements increases. This is due to the fact that alternating approaches solve the passive beamforming in the joint beamforming by solving the SDR sub-problem ( P 3 ) , which is directly sensitive to the number of reflecting elements as it increases. Table 4 also demonstrates that our proposed DDPG-based algorithm has the lowest complexity, e.g., around 0.01 s. This is because of the number of hidden layers in the proposed DDPG-based algorithm that are designed to fit different deployed elements and thus remain unchanged as the applied N per IRS grows. Moreover, as the number of reflecting elements rises, the number of sub-problems solved for alternating approaches increases. This comes from the fact that many iterations may be needed for convergence until finding suboptimal solutions to the beamforming, while the DDPG-based algorithm remains unchanged, without the need to solve any sub-optimization problems. For example, at N = 10 elements, the AO approach consumes an average of 8.5039 s, while, at N = 20 elements, it consumes an average of 19.8383 s, with the equivalent of solving, on average, eight sub-problems. In comparison, the time consumed by the proposed DDPG-based approach is still around 0.01 s, which reflects the effectiveness of the proposed DDPG-based algorithm in the implementation time.

6. Conclusions

In this work, we have investigated the joint optimization of the active and passive beamforming vectors for a single cellular network assisted by multiple IRSs, intending to minimize the transmit power of the BS. Based on the ML approaches, a new GRNN-based algorithm was developed for the static UE scenario, and a DDPG-based algorithm was proposed for the dynamic UE scenario. It was demonstrated that the proposed approaches could conduct real-time beamforming predictions in various channel conditions while maintaining performance according to measured criteria. The hyper-parameters of both models were fine-tuned to provide robust stability and convergence and minimize overfitting issues in real-time predictions. Furthermore, the presented algorithms surpassed the traditional alternating optimization techniques in terms of time complexity and optimized transmit power. Note that we restrict our study to a single-cell scenario in this work. In future work, we can study the large-scale system in a multiple-cell scenario. Moreover, imperfect CSI and advanced IRS technologies such as STAR-RIS can also be considered.

Author Contributions

Conceptualization, Z.F.; Software, M.F.; Validation, M.S.A.; Formal analysis, M.F. and J.G.; Investigation, M.F.; Data curation, J.G.; Writing—original draft, M.F.; Writing—review and editing, J.G.; Visualization, M.S.A.; Supervision, Z.F.; Project administration, J.G.; Funding acquisition, Z.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China under Grant 62001029 and in part by Beijing Natural Science Foundation under Grant L202015.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Saad, W.; Bennis, M.; Chen, M. A vision of 6G wireless systems: Applications, trends, technologies, and open research problems. IEEE Netw. 2019, 34, 134–142. [Google Scholar] [CrossRef] [Green Version]
  2. Gong, S.; Lu, X.; Hoang, D.T.; Niyato, D.; Shu, L.; Kim, D.I.; Liang, Y.C. Toward smart wireless communications via intelligent reflecting surfaces: A contemporary survey. IEEE Commun. Surv. Tutor. 2020, 22, 2283–2314. [Google Scholar] [CrossRef]
  3. Di Renzo, M.; Debbah, M.; Phan-Huy, D.T.; Zappone, A.; Alouini, M.S.; Yuen, C.; Sciancalepore, V.; Alexandropoulos, G.C.; Hoydis, J.; Gacanin, H.; et al. Smart radio environments empowered by reconfigurable AI meta-surfaces: An idea whose time has come. EURASIP J. Wirel. Commun. Netw. 2019, 2019, 1–20. [Google Scholar] [CrossRef] [Green Version]
  4. Hu, S.; Rusek, F.; Edfors, O. Beyond massive MIMO: The potential of data transmission with large intelligent surfaces. IEEE Trans. Signal Process. 2018, 66, 2746–2758. [Google Scholar] [CrossRef] [Green Version]
  5. Wu, Q.; Zhang, R. Intelligent reflecting surface enhanced wireless network via joint active and passive beamforming. IEEE Trans. Wirel. Commun. 2019, 18, 5394–5409. [Google Scholar] [CrossRef] [Green Version]
  6. Fu, M.; Zhou, Y.; Shi, Y. Intelligent reflecting surface for downlink non-orthogonal multiple access networks. In Proceedings of the 2019 IEEE Globecom Workshops (GC Wkshps), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
  7. Huang, C.; Zappone, A.; Alexandropoulos, G.C.; Debbah, M.; Yuen, C. Reconfigurable intelligent surfaces for energy efficiency in wireless communication. IEEE Trans. Wirel. Commun. 2019, 18, 4157–4170. [Google Scholar] [CrossRef] [Green Version]
  8. Yu, X.; Xu, D.; Schober, R. MISO wireless communication systems via intelligent reflecting surfaces. In Proceedings of the 2019 IEEE/CIC International Conference on Communications in China (ICCC), Changchun, China, 11–13 August 2019; pp. 735–740. [Google Scholar]
  9. Han, H.; Zhao, J.; Niyato, D.; Di Renzo, M.; Pham, Q.V. Intelligent reflecting surface aided network: Power control for physical-layer broadcasting. In Proceedings of the ICC 2020-2020 IEEE International Conference on Communications (ICC), Dublin, Irleand, 7–11 June 2020; pp. 1–7. [Google Scholar]
  10. Guo, H.; Liang, Y.C.; Chen, J.; Larsson, E.G. Weighted sum-rate maximization for reconfigurable intelligent surface aided wireless networks. IEEE Trans. Wirel. Commun. 2020, 19, 3064–3076. [Google Scholar] [CrossRef] [Green Version]
  11. Zhang, Q.; Saad, W.; Bennis, M. Millimeter wave communications with an intelligent reflector: Performance optimization and distributional reinforcement learning. IEEE Trans. Wirel. Commun. 2021, 21, 1836–1850. [Google Scholar] [CrossRef]
  12. Wang, P.; Fang, J.; Yuan, X.; Chen, Z.; Li, H. Intelligent reflecting surface-assisted millimeter wave communications: Joint active and passive precoding design. IEEE Trans. Veh. Technol. 2020, 69, 14960–14973. [Google Scholar] [CrossRef]
  13. Li, Z.; Hua, M.; Wang, Q.; Song, Q. Weighted sum-rate maximization for multi-IRS aided cooperative transmission. IEEE Wirel. Commun. Lett. 2020, 9, 1620–1624. [Google Scholar] [CrossRef]
  14. Zhao, J. Optimizations with intelligent reflecting surfaces (IRSs) in 6G wireless networks: Power control, quality of service, max-min fair beamforming for unicast, broadcast, and multicast with multi-antenna mobile users and multiple IRSs. arXiv 2019, arXiv:1908.03965. [Google Scholar]
  15. Zappone, A.; Di Renzo, M.; Debbah, M. Wireless networks design in the era of deep learning: Model-based, AI-based, or both? IEEE Trans. Commun. 2019, 67, 7331–7376. [Google Scholar] [CrossRef] [Green Version]
  16. Liu, Y.; Liu, X.; Mu, X.; Hou, T.; Xu, J.; Di Renzo, M.; Al-Dhahir, N. Reconfigurable intelligent surfaces: Principles and opportunities. IEEE Commun. Surv. Tutorials 2021, 23, 1546–1577. [Google Scholar] [CrossRef]
  17. Elbir, A.M.; Papazafeiropoulos, A.; Kourtessis, P.; Chatzinotas, S. Deep channel learning for large intelligent surfaces aided mm-wave massive MIMO systems. IEEE Wirel. Commun. Lett. 2020, 9, 1447–1451. [Google Scholar] [CrossRef]
  18. Liu, S.; Gao, Z.; Zhang, J.; Di Renzo, M.; Alouini, M.S. Deep denoising neural network assisted compressive channel estimation for mmWave intelligent reflecting surfaces. IEEE Trans. Veh. Technol. 2020, 69, 9223–9228. [Google Scholar] [CrossRef]
  19. Huang, C.; Alexandropoulos, G.C.; Yuen, C.; Debbah, M. Indoor signal focusing with deep learning designed reconfigurable intelligent surfaces. In Proceedings of the 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Cannes, France, 2–5 July 2019; pp. 1–5. [Google Scholar]
  20. Gao, J.; Zhong, C.; Chen, X.; Lin, H.; Zhang, Z. Unsupervised learning for passive beamforming. IEEE Commun. Lett. 2020, 24, 1052–1056. [Google Scholar] [CrossRef] [Green Version]
  21. Taha, A.; Alrabeiah, M.; Alkhateeb, A. Enabling large intelligent surfaces with compressive sensing and deep learning. IEEE Access 2021, 9, 44304–44321. [Google Scholar] [CrossRef]
  22. Khan, S.; Khan, K.S.; Haider, N.; Shin, S.Y. Deep-learning-aided detection for reconfigurable intelligent surfaces. arXiv 2019, arXiv:1910.09136. [Google Scholar]
  23. Xu, G.; Zhang, N.; Xu, M.; Xu, Z.; Zhang, Q.; Song, Z. Outage Probability and Average BER of UAV-assisted Dual-hop FSO Communication with Amplify-and-Forward Relaying. IEEE Trans. Veh. Technol. 2023, 1–16. [Google Scholar] [CrossRef]
  24. Cui, P.F.; Zhang, J.A.; Lu, W.J.; Guo, Y.J.; Zhu, H. Statistical sparse channel modeling for measured and simulated wireless temporal channels. IEEE Trans. Wirel. Commun. 2019, 18, 5868–5881. [Google Scholar] [CrossRef]
  25. Taha, A.; Zhang, Y.; Mismar, F.B.; Alkhateeb, A. Deep reinforcement learning for intelligent reflecting surfaces: Towards standalone operation. In Proceedings of the 2020 IEEE 21st International Workshop on Signal ProcessingAdvances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020; pp. 1–5. [Google Scholar]
  26. Huang, C.; Mo, R.; Yuen, C. Reconfigurable intelligent surface assisted multiuser MISO systems exploiting deep reinforcement learning. IEEE J. Sel. Areas Commun. 2020, 38, 1839–1850. [Google Scholar] [CrossRef]
  27. Yang, H.; Xiong, Z.; Zhao, J.; Niyato, D.; Xiao, L.; Wu, Q. Deep reinforcement learning-based intelligent reflecting surface for secure wireless communications. IEEE Trans. Wirel. Commun. 2020, 20, 375–388. [Google Scholar] [CrossRef]
  28. Chiang, M.; Hande, P.; Lan, T.; Tan, C.W. Power control in wireless cellular networks. Found. Trends® Netw. 2008, 2, 381–533. [Google Scholar] [CrossRef]
  29. Pan, C.; Ren, H.; Wang, K.; Elkashlan, M.; Nallanathan, A.; Wang, J.; Hanzo, L. Intelligent reflecting surface aided MIMO broadcasting for simultaneous wireless information and power transfer. IEEE J. Sel. Areas Commun. 2020, 38, 1719–1734. [Google Scholar] [CrossRef]
  30. Subrt, L.; Pechac, P. Intelligent walls as autonomous parts of smart indoor environments. IET Commun. 2012, 6, 1004–1010. [Google Scholar] [CrossRef]
  31. Wiesel, A.; Eldar, Y.C.; Shamai, S. Linear precoding via conic optimization for fixed MIMO receivers. IEEE Trans. Signal Process. 2005, 54, 161–176. [Google Scholar] [CrossRef] [Green Version]
  32. CVX: Matlab Software for Disciplined Convex Programming, Version 2.2. 2020. Available online: http://cvxr.com/cvx/ (accessed on 30 November 2022).
  33. Wasserman, P.D. Advanced Methods in Neural Computing; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1993. [Google Scholar]
  34. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
  35. Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  36. Sur, S.N.; Singh, A.K.; Kandar, D.; Silva, A.; Nguyen, N.D. Intelligent Reflecting Surface Assisted Localization: Opportunities and Challenges. Electronics 2022, 11, 1411. [Google Scholar] [CrossRef]
  37. Jiang, T.; Cheng, H.V.; Yu, W. Learning to reflect and to beamform for intelligent reflecting surface with implicit channel estimation. IEEE J. Sel. Areas Commun. 2021, 39, 1931–1945. [Google Scholar] [CrossRef]
  38. Fathy, M.; Abood, M.S.; Guo, J. A Generalized Neural Network-based Optimization for Multiple IRSs-aided Communication System. In Proceedings of the 2021 IEEE 21st International Conference on Communication Technology (ICCT), Tianjin, China, 13–16 October 2021; pp. 480–486. [Google Scholar]
Figure 1. Architecture of IRS-aided system.
Figure 1. Architecture of IRS-aided system.
Electronics 12 01703 g001
Figure 2. The designed GRNN structure, where θ l = d i a g ( Θ l ) denotes N phase shifts of the l-th IRS.
Figure 2. The designed GRNN structure, where θ l = d i a g ( Θ l ) denotes N phase shifts of the l-th IRS.
Electronics 12 01703 g002
Figure 3. Designed model-based framework: (a) DRL framework with DDPG agent; (b) proposed DDPG actor–critic layers.
Figure 3. Designed model-based framework: (a) DRL framework with DDPG agent; (b) proposed DDPG actor–critic layers.
Electronics 12 01703 g003
Figure 4. 2D deployment.
Figure 4. 2D deployment.
Electronics 12 01703 g004
Figure 5. Measured MAE versus spread factor values for 10 folds.
Figure 5. Measured MAE versus spread factor values for 10 folds.
Electronics 12 01703 g005
Figure 6. Optimized total transmit power of BS versus the SINR threshold under different schemes.
Figure 6. Optimized total transmit power of BS versus the SINR threshold under different schemes.
Electronics 12 01703 g006
Figure 7. Received rewards versus episode steps.
Figure 7. Received rewards versus episode steps.
Electronics 12 01703 g007
Figure 8. Average rewards versus episode steps under different learning rates.
Figure 8. Average rewards versus episode steps under different learning rates.
Electronics 12 01703 g008
Figure 9. Total transmit power of BS versus episode steps under different learning rates.
Figure 9. Total transmit power of BS versus episode steps under different learning rates.
Electronics 12 01703 g009
Figure 10. Average rewards versus episode steps under different target smoothing factors.
Figure 10. Average rewards versus episode steps under different target smoothing factors.
Electronics 12 01703 g010
Figure 11. Average total transmit power of BS versus learning episodes.
Figure 11. Average total transmit power of BS versus learning episodes.
Electronics 12 01703 g011
Figure 12. Optimized total transmit power versus SINR threshold under different schemes.
Figure 12. Optimized total transmit power versus SINR threshold under different schemes.
Electronics 12 01703 g012
Figure 13. Optimized transmit power versus the number of reflection units for each IRS under different schemes.
Figure 13. Optimized transmit power versus the number of reflection units for each IRS under different schemes.
Electronics 12 01703 g013
Table 1. Proposed DDPG hyper-parameter settings.
Table 1. Proposed DDPG hyper-parameter settings.
Hyper-ParameterSetting
Actor learning rate μ a 0.001
Critic learning rate μ c 0.001
Target smooth factor τ 0.001
Discount factor γ 0.95
Mini-batch size N B 16
Experience replay buffer size C200,000
Target update frequency U1
Sampling time T s 1
Table 2. Measured training and validation performance.
Table 2. Measured training and validation performance.
SF V MAE T MAE V MSE T MSE
0.280.23082.5185e-60.08591.0752e-10
0.290.23057.6251e-60.08528.2084e-10
0.30.23032.1302e-50.08485.4173e-9
0.310.23035.5296e-50.08453.1258e-8
0.320.23031.3417e-40.08431.5922e-7
0.330.23043.0588e-40.08427.215e-7
0.340.23066.5791e-40.08422.9245e-6
Table 3. Average prediction run time (seconds) for different SINR thresholds in static UE scenario.
Table 3. Average prediction run time (seconds) for different SINR thresholds in static UE scenario.
SINR−10 dB0 dB10 dB
Approach
AO10.656210.956212.9007
AO Heuristic3.86713.93774.5948
GRNN-Based0.22370.23660.2392
DDPG-Based0.00910.00920.0092
Table 4. Average prediction run time (seconds) for different numbers of reflecting elements in dynamic UE case.
Table 4. Average prediction run time (seconds) for different numbers of reflecting elements in dynamic UE case.
N 10 1214161820
Approach
AO8.50399.290511.052713.238515.309319.8383
AO Heuristic5.54926.56126.58868.12659.903612.7005
DDPG-Based0.00920.00940.00950.00960.00970.0098
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fathy, M.; Fei, Z.; Guo, J.; Abood, M.S. Machine-Learning-Based Optimization for Multiple-IRS-Aided Communication System. Electronics 2023, 12, 1703. https://doi.org/10.3390/electronics12071703

AMA Style

Fathy M, Fei Z, Guo J, Abood MS. Machine-Learning-Based Optimization for Multiple-IRS-Aided Communication System. Electronics. 2023; 12(7):1703. https://doi.org/10.3390/electronics12071703

Chicago/Turabian Style

Fathy, Maha, Zesong Fei, Jing Guo, and Mohamed Salah Abood. 2023. "Machine-Learning-Based Optimization for Multiple-IRS-Aided Communication System" Electronics 12, no. 7: 1703. https://doi.org/10.3390/electronics12071703

APA Style

Fathy, M., Fei, Z., Guo, J., & Abood, M. S. (2023). Machine-Learning-Based Optimization for Multiple-IRS-Aided Communication System. Electronics, 12(7), 1703. https://doi.org/10.3390/electronics12071703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop