1. Introduction
The rapid and continuous growth in the user population, coupled with the dramatic increase in demand per user, has significantly heightened the pressure on air time, turning the wireless channel into a bottleneck and making resource allocation one of the most critical challenges in wireless communication. Resource allocation in wireless communication typically refers to intra-technology scenarios, where resources must be fairly distributed among users utilizing the same technology. Resource allocation can also involve inter-technology scenarios, where the wireless channel is shared among different technologies and protocols, necessitating the distribution of resources across them. For example, the coexistence of LTE and Wi-Fi in unlicensed bands requires fair air time sharing between different protocols (e.g., [
1,
2]). Unlike inter-technology scenarios, where the challenge lies in distributing resources among users operating under different technologies and protocols with varying syntax and semantics, intra-technology scenarios involve users operating under the same protocol and following the same rules.
The need to manage the rapid expansion of services and applications like the Internet of Things (IoT) and machine-to-machine (M2M) communications, combined with the increasing complexity of global mobile data traffic, has pushed the broadband cellular network industry (5G and the upcoming 6G) to embrace machine learning (ML) and Artificial Intelligence (AI). This shift is particularly evident in the vision for 6G, which aims to support AI by design through a cost-efficient, scalable, and flexible architecture that can abstract and distribute complexities. Resource allocation, especially scheduling, is a prime candidate for leveraging these AI-based frameworks. This potential is amplified by the emergence of new use cases, such as private and enterprise 5G networks, which are expected to become even more prevalent with 6G and bring more demanding Quality of Service (QoS) requirements. In light of these developments, there is a growing need for practical, low-complexity schemes that can be distributed and implemented at the network edge. These solutions would be crucial in managing the increasing complexities of resource allocation in future cellular networks.
In cellular networks, scheduling is one of the base station’s primary tasks, determining which users receive resources like channels and power for each uplink and downlink data transmission opportunity. This function has been essential across all generations of cellular technology (e.g., [
3]). Typically, schedulers operate based on predefined performance criteria and goals, which may include maximizing overall data rates, minimizing average latency, or meeting specific deadline constraints (e.g., [
4]). Several traditional scheduling algorithms, such as round-robin (RR) and Weighted Fair Queuing (WFQ), have been adapted for wireless systems [
5]. However, in wireless networks, users are not homogeneous; different users not only have different abilities but also experience different channel conditions due to factors like fading, multipath propagation, and interference. Thus, schedulers that are oblivious to the users’ channel conditions and avoid the burden (and overhead) of acquiring any information often compromise performance. Accordingly, modern cellular networks (3G and beyond) have given rise to opportunistic schedulers that rely on channel state requisition from users and take into account physical layer information [
4]. However, an approach that favors users based on their wireless channel quality and abilities can result in severe unfairness and even starvation of some users. To avoid such significant differences between the throughput attained by various users, a scheduler should adopt a notion of fairness. However, trying to maintain fairness among the users faces two obstacles. First, the notion of fairness is ambiguous and can assume different definitions, e.g., Max-Min fairness criterion, Jain’s fairness index, and Weighted Fair Queuing (WFQ) [
6]. Second, attaining fairness can come at the expense of overall throughput. For example, balancing the throughput attained by all users can result in severe sum throughput degradation, as a user with poor channel conditions can consume most of the resources. Accordingly, a scheduler should try to balance attained throughput and fairness, recognizing the potential tradeoff between throughput and fairness (e.g., [
7]).
One of the most accepted fairness metrics that tries to cope with this tradeoff is the proportional fairness (PF) criterion, which balances the tradeoff between overall system throughput and fairness among users. The notion behind PF is that the resources allocated to each user are proportional to their respective channel conditions and past throughput. Accordingly, users with better channel conditions receive more resources, yet not to the extent that it leads to significant imbalances. Formally, the proportional fair distribution is the one that maximizes
, where
denotes the throughput that user
k has received so far and
K denotes the total number of users. Note that the logarithmic function ensures that the utility gained from allocating additional resources to a user decreases as the user’s throughput increases, which helps in allocating more resources to users with better channel conditions without starving users with poor channel conditions (more details on PF are given in
Section 3.2).
Proportional fairness-based scheduling has been extensively researched, with numerous variations and implementations. Conventional schedulers typically maximize proportional fairness metrics for each scheduling instance, relying on current user channel states and cumulative throughput per user. Specifically, the cellular base station scheduler employs user feedback on current channel conditions and a rolling average of individual user throughput to assign the next resource block (time slot and subchannel). This allocation goes to the user that optimizes the proportional fairness metric for that particular scheduling instance (e.g., [
8,
9]). However, standard QoS requirements do not necessitate optimizing performance metrics at every single time instance. Instead, key performance indicators (KPIs) are usually assessed periodically rather than at each time slot. Consequently, there is potential to enhance proportional fairness performance at evaluation points by considering projected future channel conditions for users. This approach might involve postponing allocation for a user with suboptimal current channel conditions (despite maximizing the current proportional fairness metric) and waiting for improved channel conditions within the KPI cycle.
In this paper, we explore the potential gains of incorporating the reinforcement learning (RL) paradigm into the proportional fair (PF) downlink scheduler used in operational cellular networks. An RL-based scheduler can exploit the fact that performance in typical cellular networks is evaluated at the end of predefined cycles rather than after every individual time slot. It can anticipate future users’ channel conditions and delay the allocation of resources to a user, waiting for more favorable conditions. Accordingly, an RL-based PF scheduler allocates resources not only based on the average rate achieved so far and the current channel conditions of each user, as conventional schedulers do, but also relies on future channel state predictions.
We emphasize that the PF objective scheme proposed here is an illustrative example that can be extended to other scheduling objectives. For instance, the same scheme can be easily adapted to handle traffic with due-date constraints.
In this paper, our focus is on exploring the potential of leveraging RL methodology for scheduling purposes and understanding the mechanisms behind such an approach. Therefore, we rely on Q-learning, which is suited for small-scale setups but still provides valuable insights into the scheme.
We evaluated the proposed RL-PF algorithm through simulations, demonstrating its advantages over both the greedy PF scheduler and a round-robin-based scheduler (which allocates equal time shares to each user) across various network scenarios. Additionally, we implemented the RL-PF algorithm in an operational LTE base station, assessing its performance in scheduling users with different path losses. This served as a proof of concept for employing such a scheduler in a cellular base station.
Specifically, the main contributions of this study are as follows: (i) We highlight the significance of fairness criteria in scheduling by analyzing various channel quality indicators collected from a wide range of operational 4G deployments. (ii) We provide a formal definition of PF scheduling that evaluates performance at predetermined periodic intervals, occurring after a fixed number of time slots. (iii) We develop an RL-based downlink scheduler that forecasts and incorporates near-future user channel conditions. This scheduler allocates each resource block to the user expected to maximize the PF metric at the interval’s conclusion, utilizing Q-learning to understand individual user channel fluctuations and make appropriate user selections for each time slot. (iv) Using a Python-based simulation platform, we demonstrate that our algorithm outperforms the standard PF at the predefined time instances. (v) We implement the RL-PF algorithm in an operational LTE base station, proving its practicality and viability for real-world deployment.
This paper is organized as follows.
Section 2 reviews related work.
Section 3 provides a brief background on relevant topics discussed throughout this paper. In
Section 4 we present various channel quality indicators, which are collected in multi-operational 4G deployments. The model is formulized in
Section 5. The RL-based scheduler is presented in
Section 6. The evaluation parts comprising simulation and implementation results are given in
Section 7 and
Section 8, respectively.
2. Related Work
In this study, we examine the utilization of the RL paradigm for a prescient scheduler that can learn and predict the channel states of the backlogged users associated with it and allocate resources based on these predictions. As a representative example, we consider the ubiquitous proportional fair scheduler (PFS). Accordingly, we cover related works in both aspects. We first discuss related works in the scope of ML. The second part of this section reviews related works in the scope of schedulers. Since the epitomizing example is related to fairness-based schedulers, we mostly concentrate on fairness-metric-oriented schedulers and in particular PF-based schedulers.
Reinforcement Learning Applications—The growing diversity and complexity of mobile network architectures have made monitoring and managing the multitude of network elements increasingly challenging. Consequently, embedding versatile machine learning-based solutions into future mobile networks is drawing unparalleled research interest. Reinforcement Learning (RL), a subset of ML, focuses on determining the actions needed to maximize a well-defined objective within an environment. Next-generation wireless technology will heavily rely on RL as it enables continuous, real-time optimization and decision-making in complex and dynamic network environments. Specifically, RL learns and adapts from interactions with the network, making autonomous decisions that enhance performance and resource allocation. This capability allows for adaptive management of network resources, efficient spectrum utilization, and proactive adjustments by predicting and addressing issues before they affect users. Accordingly, RL can significantly boost the performance of next-generation wireless networks in many aspects.
Over the last few years, numerous studies have incorporated RL into wireless communication protocols, demonstrating its effectiveness in various applications such as dynamic spectrum access, power control, load balancing, and more. This showcases RL’s potential to revolutionize wireless communication. For instance, recent examples include a survey on RL-based medium access control (MAC) protocols and methodologies for applying RL within the context of the MAC reference model [
10]. Additionally, a deep reinforcement learning-based decentralized resource allocation scheme is suggested in [
11] for the vehicle-to-everything (V2X) communication network, where vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) networks share resources. This scheme utilizes a deep Q-network (DQN) to determine the resource blocks (RBs) and transmit power of vehicles in the V2V network.
Current and next-generation wireless communication systems are characterized by high diversity, heterogeneity, and density. This complexity leads to resource scarcity, necessitating sophisticated resource allocation mechanisms. RL has emerged as a fertile field for innovations in addressing these challenges. Ref. [
12] focuses on optimizing system throughput and fairness in an underlay device-to-device (D2D) communication system using device-centric joint spectrum and power allocation schemes. The study utilizes a distributed RL-based iterative resource allocation scheme that overcomes the unknown and dynamic nature of channel parameters by allowing the system to learn about the environment through trial and error, without requiring any prior knowledge. Ref. [
13] studies the problem of energy-saving optimization in 5G base stations (BSs) by implementing multi-sleep mode (SM) levels. The paper investigates the tradeoff between energy saving and QoS and suggests a distributed Q-learning algorithm that can learn the best policy to control the activity of the small base stations (SBSs) in order to minimize energy consumption and data dropping. Ref. [
14] investigates energy-efficient resource allocation and power control techniques in 5G networks. The paper presents a 5G base station (BS) sleep mode management based on reinforcement learning that balances a set of metrics to find the best tradeoff between energy reduction and QoS constraints.
A literature review on deep reinforcement learning (DRL) techniques and applications used for energy optimization and resource allocation in future wireless and mobile networks is provided in [
15]. Ref. [
16] proposes an adaptive DRL-based resource allocation scheme to jointly allocate downlink power and bandwidth, thus optimizing total throughput while considering the impact of power-consumption outage problems. Ref. [
17] investigates user association in multi-unmanned aerial vehicles (UAVs)-assisted cellular networks, allowing users to connect with different UAVs or macro base stations for uplink (UL) and downlink (DL) transmissions. The paper formulates the joint optimization problem for UL-DL association and beamforming as a partially observable Markov decision process (POMDP) and employs multi-agent deep reinforcement learning (MADRL) for distributed optimization. Ref. [
18] formulates the problem of maximizing the sum rate and fairness for non-orthogonal multiple access (NOMA)-based cellular users as a mixed-integer nonlinear programming (MINLP) problem, and a learning framework was employed to address it. Ref. [
19] utilizes the multi-agent reinforcement learning (MARL) paradigm to develop distributed strategies to reduce processing complexity and energy use in large-scale multiple-input multiple-output (XL-MIMO) systems. In their multi-agent cell-free XL-MIMO system, base stations (BSs), user equipment (UE), and antennas are viewed as agents that learn to allocate resources and transmit signals efficiently. Two use cases of the MARL-aided multi-agent cell-free XL-MIMO system were studied: antenna selection and power control.
Additional state-of-the-art reinforcement learning (RL) algorithms related to resource allocation have been explored in various studies. In [
20], researchers investigate RL for downlink inter-cell interference mitigation through power control and rate adaptation. They propose an RL framework to develop an efficient power control policy that optimizes the downlink transmission power of LTE cells, aiming to maximize network-wide utility for users. Their RL state representation includes LTE cell measurements such as cell downlink transmission power, average Reference Signal Received Power (RSRP), and average user interference within each cell. Actions are based on this representation adjust cell transmission power, with a focus on maximizing utility functions like average sum log-throughputs across cells. In [
21], the authors design and implement an RL testbed for Multi-User MIMO (MU-MIMO) user selection in IEEE 802.11ax. Their approach applies Q-learning with the
-Greedy policy [
22] to select user groups for scheduling. In [
23], a Q-learning algorithm for allocating RBs to users is devised. The algorithm relies on the scheduler receiving CQI feedback from all users before each RB allocation, assigning the RB to the user that maximizes Jain’s Fairness Index (JFI) [
24].
In [
25], the authors propose a multi-agent Q-learning algorithm for joint power and resource allocation in 5G networks. In this algorithm, each general eNodeB (gNodeB) performs joint power and resource block allocation at every scheduling interval. The states and rewards are formulated to consider two critical factors: inter-cell interference and queuing delay. The authors in [
26] use a deep reinforcement learning (DRL) approach to optimize resource allocation in OFDMA systems. Specifically, they propose a centralized DRL-based scheduling scheme that is traffic- and channel-aware. This scheme relies on real-time network traffic and channel conditions to manage downlink OFDMA time-frequency resources at a wireless access point. In [
27], a DRL-based resource allocation approach is developed for ultra-dense networks (UDNs) in the presence of very limited channel state information (CSI). The multi-objective problem that considers the tradeoff between spectrum efficiency (SE), energy efficiency (EE), and fairness is decomposed into two components. Specifically, the deep neural network is designed to maximize SE, while the EE and fairness considerations are incorporated as rewards to guide the training process. Ref. [
28] explores the use of DRL framework to maximize user fairness in terms of delay. Ref. [
29] proposes a 5G downlink scheduling mechanism that employs Q-learning techniques to improve wireless transmission quality and efficiently manage the radio resources of a BS. The mechanism aims to enhance various Quality of Service (QoS) parameters, including throughput, delay, and fairness, as measured by Jain’s fairness index. Ref. [
30] presents a novel method to integrate fairness into actor-critic reinforcement learning for network optimization tasks. Their algorithm adapts the reward system by using a weighting factor that accounts for both the structure of the fairness utility function and historical reward statistics. This approach ensures that the rewards are adjusted dynamically based on these two elements.
Scheduling in Wireless Communications— The scheduler in a cellular-based network plays a fundamental role; it is the “brain” of the whole system [
3]. Traditionally, its basic function is to allocate resources for the connected users, taking into account QoS and channel condition. Different schedulers consider different utility metrics, e.g., max throughput and various types of fairness (proportional, equal share, Max-Min, etc.), and therefore the network performance is influenced by the objective. Scheduling algorithms that were commonly adapted in implemented wireless systems disregarded users’ channels, for the sake of simplicity, e.g., round-robin (RR), which serves users in a circular manner without any other consideration, [
31], and Weighted Fair Queuing (WFQ), which allocates resources with respect to weights that are associated with every user [
32]. In contrast, 4G and 5G cellular networks have given rise to schedulers that take advantage of physical layer information, such as user channel states. Such schedulers are named opportunistic schedulers. Proportional fair scheduler [
33] is one of the popular channel-aware schedulers implemented in today’s cellular systems and being as such, much work has been performed on the subject.
Opportunistic schedulers take into account information such as the users’ channel conditions and QoS, for the purpose of allocating resources for each connected user. Opportunistic scheduling was first coined in [
34], showing that by utilizing multi-user diversity, a significant improvement in capacity can be made. Their approach was to schedule the user (or users) with the best channel state in any given time slot. The work of [
35] was continued and improved in [
34] by considering delay aspects, when they treat the situation of starvation. Best Quality Indicator (BCQI) [
36] is a simple LTE scheduling algorithm that aims to maximize the cumulative throughput by assigning the network resources to users that report the best channel quality in the current time slot. Max-Weight scheduler [
37], selects the user with the highest product of queue length and transmission rate. The algorithm was considered throughput-optimal until [
38] proved otherwise under flow level dynamics. However, Max-Weight is still throughput-optimal with fully backlogged queues. We refer the interested reader to [
3,
4] for more details on the different scheduling algorithms. As mentioned, the aforementioned scheduling solutions are throughput optimized; however, optimizing the cell capacity comes with the cost of unfair resource allocations to those users that experience bad channel quality.
Since opportunistic schedulers can favor users over substantial time intervals, fairness is a major concern in communication systems, e.g., scheduling users according to their channels’ conditions might lead to starvation of users with poor channel quality, which is not only unfair but can lead users to unsubscribe from the service provider. The notion of fairness is ambiguous (i.e., what is a fair allocation and how we can measure it), and different metrics were suggested over the years to define fairness, e.g., Jain’s index, Max-Min, proportional fairness, and more.
Jain’s index is a widely known metric which we utilize in our simulations (
Section 7). This is a quantitative measure, first proposed by [
24] that aims at quantifying the fairness of resource allocation among traffic flows. Jain’s index values range between one divided by the number of flows and one, where a unity value indicates maximal fairness. Such a metric, however, disregards the heterogeneity of the users and their channel conditions, and tries to balance the rates attainable by all users, irrespective of their capabilities. Max-Min fairness is a qualitative measurement of fairness that has been studied widely and implemented in various applications [
39], and its main motivation is that all users should attain the same performance (e.g., throughput). Specifically, a system reaches Max-Min fairness if it cannot increase any individual’s share without decreasing another’s that has a lower one. Accordingly, Max-Min fair allocation can allocate most of the resources to users with very poor capabilities (e.g., poor channel conditions), sacrificing overall performance in order to improve the performance for degraded users.
One of the most diffused opportunistic approaches with fairness constraints is the proportional fair scheduler (PFS), first introduced by Kelly in [
8]. A set of attainable throughput (rate) allocations
is proportionally fair if it maximizes
, where
N is the set of users. Later, [
9] adapted the proportional fairness criteria and suggested a scheduling algorithm that assigns priorities to users based on the ratio of two components: the first component accounts for the rate potentially achievable in the current time slot, while the second component accounts for the historical average of the user’s throughput. Such an algorithm was proven to maximize the logarithmic sum of capacities. The work of [
40] suggested that [
9] does not fully address the problem of starvation due to a constant average window size of the UE rate, which relates to the time a UE can be starved, in scenarios, for example, where a user experiences sudden drops in channel quality, forcing the PFS to starve the user for a certain time period. Instead of using a constant average window size, they offered to update the average UE rate by monitoring the UEs’ starvation time period, giving rise to a much more starvation-aware scheduling.
In [
41], four PF scheduling algorithms were introduced that are based on the PF algorithm for handling cases where users are not fully backlogged and a user can be in an idle or active state. The first two algorithms only update the average throughput of active users, instead of updating average throughput for active and idle users. The third algorithm suggests a different approach for a user’s throughput update, according to the status of the user in the previous slot, i.e., active or idle, and the last time the user was scheduled. The fourth algorithm dynamically changes the averaging time window with respect to the backlog of users. Ref. [
42] extended PF scheduling to multi-carrier transmissions systems, proposing an assignment of users to each carrier while maximizing the sum of logarithmic average user rates.
PF channel-aware schedulers, such as those proposed by [
43], rely on channel stationarity and achieve fairness by employing a moving average scheme. At each time slot, the channel is allocated to the UE that maximizes the average PF metric at the end of the current slot. These schedulers achieve fairness over long time intervals and can even converge to the optimal fair allocation as time approaches infinity (see [
33]). However, typical operational networks measure performance and report key performance indicators (KPIs) periodically, after finite and predefined time intervals. Consequently, “greedy” schedulers that maximize fairness at every time slot may compromise overall performance. This is because it may sometimes be beneficial to postpone scheduling a user who maximizes the current fairness metric within the KPI time interval, waiting for that user to experience better channel conditions.
A strategy that allows the postponement of an allocation to a deserving UE, waiting for better channel conditions, does not ensure fairness after each allocation. However, such a strategy can boost overall performance at the end of each KPI time interval. To deploy this strategy effectively, a mechanism for predicting users’ channel conditions is required. Analysis of the gains from knowing future channel conditions is proposed in [
44]. Specifically, the authors assume that future channel states for all users are known a priori (i.e., accurate estimates of the users’ attainable data rates for the following
time slots are available to the scheduler at each time slot). They compute the optimal allocations for the finite
L time slots and compare these to the PFS greedy allocation.
The authors of [
45] extend the single-channel prediction-based PFS scheme suggested in [
44] to evaluate the effects of considering predicted data rates on the performance of multi-carrier networks. Specifically, ref. [
45] investigated a multiple subchannel model in which the rates each user can attain on each channel are known a priori, and the scheduler determines which user is allocated to each subchannel (a user may be assigned more than one subchannel). The authors show that the resulting optimization problem is a complex combinatorial problem in both the time and frequency dimensions and propose two suboptimal algorithms.
4. Operational 4G Deployments’ Statistics
In this study, we propose an opportunistic RL-based scheduler that considers both current and anticipated future user channel quality. The advantages of the proposed scheme lie in two key aspects: its opportunistic approach to resource allocation and its ability to predict users’ channel states. While our focus is on fairness, we also recognize the applicability of these aspects to a broad range of objectives, such as due date-constrained traffic and QoS guarantees.
To understand the significance of these aspects, particularly the potential benefits of opportunistic schedulers, we analyze channel statistics collected from various operational 4G deployments from two perspectives: intra-cell and inter-cell UEs’ channel quality variability. Intra-cell channel quality variability highlights the differences in channel quality experienced by UEs within the same cell, thereby justifying the need for opportunistic resource allocation schemes that adapt to the varying channel qualities of UEs. Additionally, considering that the channel quality of each UE may fluctuate over time due to factors like user movement or environmental changes (e.g., movement of other UEs in the vicinity), this underscores the predictive nature of the proposed scheme. Inter-cell channel quality variability, on the other hand, underscores the diversity of the cell environments themselves and the variability in UE distribution within cells, emphasizing the importance of a distributed learning-based scheduling scheme conducted independently by each cell.
We utilized two datasets to illustrate the high diversity in users’ channel quality characteristics, which are correlated with the users’ attainable rates. The first dataset comprises data gathered by a 4G vendor from 90 operational 4G deployments across diverse locations in the United States, including both urban and rural environments. The second dataset consists of data collected from various 4G operators in the states of New York and Florida.
We examine three common KPIs: (i) Channel quality indicator (CQI): This KPI maps the users’ channel condition into a quantized discrete value between 0 and 15 (further details on CQI can be found in
Appendix A.1). The same CQI values were used in both the simulation and implementation. (ii) RSRP is the average of the base station’s reference signals measured by the users, indicating the strength of the base station signal to a user. The farther the distance between the user and the base station, the lower the RSRP value received, and vice versa. We also compare RSRP with SNR in rural and urban deployments to provide further insights into how the environment type affects users’ channel conditions. (iii) Average resource utilization: Measured in physical resource blocks (PRBs), this KPI is assessed for both urban and rural deployments (further information on LTE resource allocation can be found in
Appendix A.1).
We note that the schedulers deployed in the base stations that collected the data are PF-based, similar to those used in our simulation and implementation results for comparison with the proposed scheme. Additionally, the users’ channel characteristics obtained in this subsection are utilized in our simulations and implementation (
Section 7 and
Section 8, respectively).
Figure 1 depicts the average number of CQI reports collected for each CQI value during one day, averaged over the inspected cells, i.e., all measurements collected were divided by the total number of cells.
Figure 1 confirms the diversity in channel quality attained by users (recall that the measurements are taken throughout one day, hence the diversity shown encompasses both a variety of users as well as users whose channel state has been changed). As seen in the figure, even though most measurements reported good channel conditions and only a few experienced inferior ones, a substantial number of measurements encountered moderate channel conditions. Assuming that the distribution to CQI roughly reflects the temporal CQI distribution of backlogged users during a given time in the day,
Figure 1 highly motivates the utilization of a fairness metric that balances between maximizing the attained users’ sum rate while providing each user a reasonable share of the resources. The observation that the diversity also relates to users with varying channel conditions motivates the use of an RL-based scheduler. The proportional fair scheduler, which we utilize in this study, is a channel-aware scheduler that utilizes users’ channel diversity for a better resource allocation policy and aims to balance the overall attainable rate and fairness, giving more RBs to users with high CQI values, yet allocating sufficient RBs to the low-CQI users such that the share each user is given conserves the ratio between the users’ average achievable rate. Note that the high standard deviation shown in the figure, representing the high variability, further emphasizes the need for an opportunistic channel-aware scheduler leveraging a learning-based solution.
Reference Signal Received Power (RSRP) is a key performance indicator (KPI) that assesses the average power of the reference signals received by the UE across a specific bandwidth. It is measured in dBm and typically ranges from about −140 dBm (very weak signal) to −44 dBm (very strong signal). The RSRP reports collected at urban cells and rural cells are depicted in
Figure 2a and
Figure 2b, respectively.
As can be seen in the figures, in both types of deployments, the received signal strength is widely dispersed, ranging from very poor RSRP ( dBm to dBm) to excellent RSRP ( dBm to dBm). The average RSRP values for the two kinds of deployments differ significantly: for urban cells it is −97 dBm, while for rural cells it is −104 dBm. This difference can be explained by the characteristics of the different deployments. In a more urban environment, more base stations are deployed, and therefore, users are mostly closer to their associated base stations. In contrast, in a rural area, fewer users are connected, and fewer base stations are deployed, such that each base station serves users who are farther away. Moreover, the RSRP distribution (especially in the rural deployment) seems to follow a right-skewed distribution. This result can be explained by the fact that most users in such deployments are in low-to-medium coverage areasfrom the base stations, while few users are in excellent coverage areas, extending the right tail into high RSRP values.
Overall, the RSRP measurements depicted in
Figure 2 further support the CQI measurements in demonstrating the high diversity in signal path loss experienced by UEs, which indicates considerable differences in the received signal strength in both types of deployments.
Next, rather than averaging over a large number of cells, showing significant variance between the different cells, we focus on specific zones showing variability within a zone. We distinguish between urban and rural deployments.
Figure 3 depicts a heat map indicating RSRP and SINR in the urban region of New York and the more rural region of Florida. The heat maps are divided into different zones according to zip codes. Specifically,
Figure 3a and
Figure 3c depict the RSRP of the urban region of New York and the rural region of Florida, respectively. We defined an RSRP threshold of
[dBm], which was taken from lab results and indicated a user with poor coverage, i.e., a user with a low CQI (a CQI that corresponds to QPSK modulation) and colored each zone based on the percentage of RSRP reports which were below this threshold (denoted by
). For example, the dark green color indicates that less than 5% of the RSRP reports in the zone were below
[dBm], and zones marked in red are zones in which more than 20% of the collected RSRP reports were below the
[dBm] threshold. Similarly,
Figure 3b and
Figure 3d show the average SINR reporting in the urban region of New York and the rural region of Florida, respectively, where the SINR threshold was set to 2 [dB], i.e.,
denotes the percentage of SINR reports in a zone which are below 2 [dB].
The RSRP heat maps depicted in
Figure 3a,c confirm the assertion that the received signal strength in an urban environment is expected to be stronger than in a rural one. In contrast,
Figure 3b,d indicate that in the urban environment, there is a large number of zones where users suffer from poor SINR conditions, even though the RSRP indicated is mostly good, whereas, in the rural area of Florida, an inverse effect is observed, more zones are associated with poor RSRP. However, most users experience good SINR. The seemingly contradicting observation can be explained by the fact that despite the high (low) received signal strength (RSRP) in an urban (rural) environment users in the urban (rural) environment are more (less) prone to inter-cell interference, and therefore have low SINR, such that counterintuitively, users’ SINR in the urban environment might experience worse channel conditions than in the rural environment.
Last, we examine the daily average system resource utilization per scheduling time unit in both rural and urban deployments. User resource allocations are measured in PRBs, which are the basic time-frequency resource units that can be allocated to a UE. The number of PRBs available in the system per scheduling time unit is determined by the LTE base station’s bandwidth. At each scheduling epoch, the available PRBs are distributed among the selected UEs. The rate, and therefore the number of bytes attained by each scheduled UE per PRB, depends on its channel conditions. The number of PRBs allocated to each scheduled UE is determined by its target number of bytes and its channel condition. A user in poor channel conditions will require more PRBs to support the required number of bits, and vice versa. Therefore, a higher average PRB utilization can be attributed to a higher number of UEs with poor channel conditions and to a higher number of UEs scheduled at each scheduling epoch.
The measurements show that the average number of PRBs allocated to each UE in rural terrain is
, while in urban terrain it is
. This indicates that the average PRB allocation in an urban deployment is approximately twice as much as in the rural counterpart. These results can be attributed to the observation that more UEs in an urban environment experience poor channel conditions than in the rural counterpart, as can be seen in
Figure 3b,d. It can also correlate with the characteristic that the population density is typically higher in urban environments compared to rural ones.
In this section we illustrated the high variability in users’ channel quality, which highly motivates for an efficient fairness metric that balances between maximizing the throughput yet giving each user a substantial share of the available resources, one that shares the resources proportionally to its capabilities, yet allows each user to operate (run applications) satisfactorily.
6. Reinforcement Learning Proportional Fairness-Based Scheduler
In this section, we present the RL-PF-based scheduler. The motivation for an RL-based scheduler relies on the understanding that since the performance, and specifically the proportional fairness criterion (
2), is measured only at the end of each cycle, it is sometimes beneficial to delay an allocation for a user, waiting for better channel conditions within the same cycle. We illustrate this with a simple example depicted in
Figure 4.
Assume a BS serving two users (
), and that each cycle comprises ten time slots (
). The blue and orange curves in
Figure 4a,b depict the attained rates by users UE-1 and UE-2, respectively, over an exemplary cycle. The asterisks in
Figure 4a depict the scheduled user at each time slot according to the greedy PF algorithm described in
Section 3.2. For example, the first slot is allocated to the user that maximizes the current PF criteria, which is UE-1 (since both users have received an equal share so far, zero bytes, the user with better channel quality is given the allocation); similarly, the second slot is allocated to UE-2 that recieved nothing so far in the cycle, hence contributes the most to the PF criteria, etc.
Figure 4b depicts the allocations that are based on a scheduler that aims at maximizing the proportional fairness criteria at the end of the cycle, hence postpones the allocation for UE-2, waiting for it to attain better channel conditions. The blue and orange curves in
Figure 4c depict the sum log throughput of the two users (Equation (
2)) prior to every time slot according to the greedy PF algorithm and the delay-tolerant PF algorithm, respectively. Time slot
j on the x-axis corresponds to the sum log throughput at the end of slot
and the beginning of slot
j, just before the scheduler selects the UE to be allocated slot
j according to the two schedulers. As can be seen in the example, allocating slots 2 and 4–8 to UE-2 by the greedy PF scheduler provides a better instantaneous gain in (
2) at the end of the first 8 slots out of the 10-slot cycle; however, allocating it slots 9–10 when it has better channel conditions attains higher PF utilization at the end of the cycle. Additional motivation can be found in [
44], which shows that the availability of future channel estimations could help improve the proportional fairness criterion.
Maximization of
, as defined in (
5) encounters two main challenges: (i) the scheduler (BS) needs to know all users’ channel conditions for the entire cycle at the beginning of the cycle, which necessitates users’ channel prediction, and (ii) attaining optimality requires an exhaustive search over all users’ scheduling possibilities. To address these challenges, we employ a Reinforcement Learning (RL) scheme characterized by four elements: agent, environment (states), actions, and rewards (see
Section 3.1). The agent, in our case the BS or more specifically, the scheduling algorithm deployed at the BS, determines to which user to allocate each resource block. The environment encompasses the current channel states of all the users. The action space, which determines which user will be allocated the next resource block, comprises all users, and the reward is the sum log throughput that is attained only at the end of the cycle. The policy, which is the aim of the suggested scheme, defines the agent’s strategy for associating actions with states. In the following, we formalize the main components of the PF scheduling setup:
The state space is represented by a vector with
K elements, where
K is the number of users. Each entry in the vector consists of a 2-tuple
, where
is the accumulated throughput that user
i has achieved up to the start of time slot
n in the current cycle, and
indicates the channel quality for user
i at the beginning of time slot
n. This structure results in a distinct set of states for each time slot within the cycle. Specifically, denoting by
the set of states associated with the beginning of time slot
n, we have:
The action space, denoted by , is defined by a specific user scheduled at the beginning of time slot n, who will be allocated the entire bandwidth. The key challenge lies in the fact that the action (choosing which user to schedule next) taken at each time slot must be evaluated based on future rewards (delayed rewards). It is important to note that in this study, we deal with a deterministic action space, meaning the scheduler must choose a specific user to assign the resource block to. This contrasts with a stochastic action space, where the scheduler can allocate time slot n to different users according to some probability distribution, i.e., the time slot is allocated to each user i with a probability that is not necessarily 0 or 1 ().
The reward, denoted by
, is given only at the end of each scheduling cycle and is computed based on the proportional fairness criterion as formulated in Equation (
5), i.e.,
. An important characteristic of this reward structure is that intermediate actions do not yield instantaneous rewards. Instead, the reward is determined only once the entire cycle has been completed.
In this paper, we focus on understanding the potential gains of deploying RL for the PF-based scheduler, and we leave the challenge of handling an enormous state space (e.g., large user populations, high number of supported rates, etc.) for future work. Accordingly, we utilize Q-learning, a model-free RL algorithm that learns the value of each action for each state.
We employ the epsilon-greedy approach for our scheduling strategy. In this method, the scheduler follows the learned policy for the current state with a probability of
or, with a probability of
, deviates from this policy to explore various actions. Specifically, when exploring, it selects a user randomly and allocates the resource block (RB) to that user (e.g., [
52]). After each scheduling decision at time slot
n, we calculate and store a Q-value in the Q-table. This value, denoted by
, corresponds to the current state–action pair
. The Q-value is updated based on the previous Q-value, the maximal Q-value of the next state multiplied by the discount factor, and scaled by the learning rate. The specific formula for this update is detailed in
Section 3.1.
Algorithm 21 depicts the pseudo-code of the proposed Q-learning-based scheduling algorithm.
Algorithm 1 Q-Learning PF-Scheduler |
- 1:
Initialize Q-table - 2:
while Training do - 3:
Update with UEs’ CQI reports - 4:
if not in Q-table then - 5:
Add to Q-table - 6:
end if - 7:
With probability select a random UE to schedule - 8:
Otherwise select the UE that corresponds to: - 9:
Allocate DL transmissions to UE-k - 10:
Update UE-k throughput with the corresponding rate: - 11:
if then - 12:
proceed to next time slot - 13:
Receive CQI reports - 14:
- 15:
Transit to next state - 16:
else if then - 17:
Calculate reward - 18:
Initialize - 19:
end if - 20:
Update - 21:
end while
|
The Q-learning scheduler algorithm we implemented consists of the following key components:
User Equipment (UE) Selection (lines 7–9): At each time slot, the algorithm decides which UE to schedule using the exploration/exploitation policy. It either chooses a random UE (exploration) or selects the UE that maximizes the learned reward (exploitation).
State Update (lines 10–15): After downlink (DL) transmission to the chosen UE, the algorithm updates the next state with the observed throughput and new channel quality indicator (CQI) reports.
Reward Calculation (not explicitly shown): The reward is calculated at the end of the scheduling cycle. There are no mid-cycle reward calculations, so the reward is zero during the cycle.
Next Cycle Initialization (lines 16–18): These lines depict the state initialization for the subsequent cycle.
Q-value Update (line 20): The current Q-value for time slot n is updated using the current reward and the difference between the maximum possible discounted future Q-value and the current Q-value. This adjustment steers the Q-value towards the maximum reward at the cycle’s end.
It is important to note that the reward calculated at the end of the cycle backward-propagates to all states involved in that cycle, influencing their Q-values. This structure allows the algorithm to learn and improve its scheduling decisions over time based on the outcomes of its choices.
The computational complexity of the BS for each update includes action selection, action execution, and Q-value update. In an exploration iteration, the BS randomly chooses which user to schedule next, with a time complexity of (line 7 in Algorithm 1). The exploitation iteration involves a lookup in the Q-table at the index of the current state and choosing the action with the highest value. Since the BS can choose between N possible actions (N users to choose from), it needs to perform comparisons to find the one with the greatest Q-value among the N users, resulting in mathematical operation (line 8 in Algorithm 1). The Q-table update for both exploration and exploitation involves updating a single Q-value according to the Q-value equation (line 20 in Algorithm 1). To update the Q-value, the BS first updates the accumulated attained rate of the scheduled user by adding the rate attained in the last slot to the aggregate rate received by the user so far in the current cycle (line 10 in Algorithm 1), which requires a single arithmetic operation. To update the Q-value, the BS retrieves the previous Q-value and the current Q-values for the new state , with time complexities of and , respectively. Note that the new state is determined based on the channel state collected from the users (line 13 in Algorithm 1). The max operation to find takes time. The arithmetic operations, which include the temporal difference calculation and the update , are each. Overall, the time complexity per Q-learning update is and includes lookups, 1 comparison, and 5 arithmetic operations. Note that a reward is given only at the end of the cycle (). Accordingly, at the end of each cycle, the BS needs to calculate the reward (line 17 in Algorithm 1), which includes taking the logarithm of N values and summing them, resulting in arithmetic operations.
The space complexity (memory complexity) for storing the Q-table requires storing the Q-values for all the state–action tuples for the entire cycle. The state comprises both the accumulated rate each user has attained so far in the cycle as well as its current channel state. Since the throughput a user attains when scheduled is based on the transmission rate, which is directly associated with its channel state, the number of different throughput gains a user can accumulate when scheduled is equal to the number of Index of Transport Block Size (ITBS-see
Appendix A.1). Accordingly, denoting the number of different ITBSs by
, each cycle stage can branch into
different states. Since each cycle is composed of
L stages (allocations), the table contains
elements. Note that if the transmission can fail, the number of elements in each stage (inside the parenthesis) should be multiplied by 2, corresponding to successful and unsuccessful transmissions. Even though the latter leaves the accumulated rates unchanged, it corresponds to a different stage and hence relates to a different state.
Overall, we observe that the computational complexity remains manageable as the system size increases and is comparable to other schedulers, as any selective scheduling algorithm must account for all users. However, the space complexity poses the most significant scalability barrier. Specifically, the exponential growth of the state space with respect to cycle length indicates that this approach does not scale efficiently for large systems or longer cycles. For example, a small-scale deployment serving 50 users with 26 different ITBS values (as defined by the LTE standard) and a cycle of 10 slots would require a Q-table containing elements, which is clearly infeasible. Moreover, beyond the memory limitations, an extremely long learning phase would be required to explore all these states sufficiently.
8. Implementation
To demonstrate the feasibility of the proposed scheme, we implemented the Q-learning RL-PF algorithm within an operational LTE base station serving three connected users.
The implementation testbed features a single-cell 4G network operating on a 10 MHz bandwidth. It employs a split architecture where the RF component, known as the remote radio head (RRH), is separated from the upper layers (such as PHY and MAC). These upper layers are implemented on a commercial off-the-shelf (COTS) server, referred to as the baseband unit (BBU). In this setup, the RRH is connected to two RF splitters, enabling the 2 × 2 RRH to interface with three RF cages housing the users. To precisely control each user’s path loss to the LTE cell, variable attenuators (V/A) are integrated into each RF connection. The MAC layer hosts the scheduler entity, which is responsible for making scheduling decisions. It is configured to allocate resources to one user per time slot.
Figure 9 provides a visual representation of this testbed architecture.
Our system evaluation used a typical three-user configuration, representing diverse user channel state characteristics. Specifically, UE-1 had near-ideal channel conditions ( path loss), UE-2 had moderate channel conditions ( path loss), and UE-3 had challenging channel conditions ( path loss). To ensure continuous data demand, we generated UDP downlink traffic for each UE, maintaining a fully backlogged state across all users.
In addition to the proposed RL-PF scheduler, we also implemented the PF scheduler as formulated in
Section 3.2. Furthermore, similar to our simulation approach, we tracked the channel states attained by each UE throughout the cycle. At the cycle’s end, we computed the optimal schedule that could have been achieved if all channel states (ITBSs) were known in advance. These implementations enable us to compare the performance of the RL-based scheduler against both the implemented PF scheduler and the theoretical optimal scheduling scheme under realistic network conditions.
The implementation of the RL-PF-based scheduler on 4G base station required several code adaptations in the MAC layer where the scheduler and the downlink link adaptation are implemented. Specifically, besides the modules that are associated with the RL, which were coded from scratch, some modules such as the queuing, DLLA, UE scheduling, and the DL TB assignment needed to be modified in order to support the RL-PF-based scheduler. A further detailed description of the modifications is given in
Appendix A.2.
The RL-PF-based scheduler described in
Section 6 required a minor adaptation for practical implementation. The original algorithm assumes that the channel quality between the BS and each UE is available prior to every time slot. However, in operational systems, Channel State Information (CSI) reports are typically received from UEs every 80 ms, not every time slot. To maintain accurate channel quality estimation, which is crucial for rate adaptation, operational BSs utilize additional feedback mechanisms. Specifically, they use ACKs/NACKs from each DL packet transmission to update channel quality estimates and adjust transmission rates accordingly. This process, known as downlink link adaptation (DLLA), is detailed in
Appendix A.1. To address the limited granularity of CQI reporting, we modified the RL-PF state representation. Instead of relying directly on CQI, we use the ITBS, which is the output of the link adaptation mechanism. This ITBS incorporates both CQI reports and ACK/NACK feedback, providing a more frequently updated measure of channel conditions. Consequently, the implemented state space is redefined as follows:
where
denotes the state at the beginning of time slot
n, and is composed of the accumulated number of bytes attained so far in the cycle and the ITBS of each UE-
k connected to the base station.
The learning process is conducted offline following a measurement session where the UEs’ ITBS and throughputs for each time slot were recorded. ACK/NACK feedback, which is received by the scheduler 4 ms after the downlink transmission as specified by the 4G standard, is taken into account. After collecting all feedback from each UE, the actual throughputs at the end of a scheduling cycle are compiled. These throughputs are then used to calculate the PF criteria, which are utilized in the learning process. The Q-learning parameters used were the same as those used in the simulation results.
Implementation Results
The testbed results for the three UEs, using a cycle length of 5 slots (
L = 5), comparing the RL-PF scheduler with the PF scheduler and the optimal schedule, are presented in
Figure 10. Specifically,
Figure 10a shows the average sum log throughputs, while
Figure 10b displays the average cell throughput.
Figure 10a shows that the average sum log throughputs for the PF scheduler, RL-PF scheduler, and optimal schedule are 25.7, 30.4, and 36.3, respectively, demonstrating an
improvement of the RL-PF scheduler over the PF scheduler.
Figure 10b shows average cell throughputs of 19.4, 12.6, and 31.1 Mb/s for the PF scheduler, RL-PF scheduler, and optimal schedule, respectively, indicating a
cell throughput loss for the RL-PF scheduler compared to the PF scheduler. This is consistent with the simulation observations in
Section 7.2.4.
This tradeoff between fairness and performance is well recognized. In our setup, as explained in
Section 7.2.4, it is attributed to the RL-PF scheduler’s strategy of allocating extra RBs to UEs with higher packet drop ratio (PDR) in order to mitigate potential transmission failures. This strategy has a dual effect: not only does the RL-PF-based scheduler assign more slots to UEs with lower success probabilities, increasing the average number of unutilized slots, but these UEs are typically also the ones with poor channel quality (lower data rates). Consequently, the PF scheduler achieves a higher number of utilized slots for UEs with high data rates (greater overall cell throughput) at the cost of reduced fairness (lower sum log throughputs). Note that even though the downlink link adaptation (DLLA) mechanism, explained in
Appendix A.1, is responsible for converging the users to a target block error rate (which is, as per design, the same for all users), some users still suffer from higher loss probability.
Specifically, in our setup, UE-1 has consistently superior channel conditions. Hence, the RL-PF scheduler allocates to it a single time slot. UE-2 and UE-3, with different channel conditions but similar probabilities of successful transmission due to the DLLA mechanism, share the remaining time slots (each is allocated two time slots in our setup). Furthermore, since the RL-based scheduler cannot know a priori which transmissions will fail and must rely on anticipated loss probabilities, in some cycles it can underperform in both metrics. For example, when all five slots are successfully transmitted, the PF scheduling policy outperforms the RL-based policy in both the sum log throughputs and the aggregate cell throughput metrics. On the other hand, since the PF scheduler allocates only a single transmission (time slot) to one of the UEs with inferior channel conditions, which occasionally fails, in many cycles one of these UEs will attain zero throughput. Overall, as can be seen in
Figure 10a, on average the RL-PF-based scheduler attains much higher proportional fairness scores.
9. Conclusions
This research aimed to explore the potential of reinforcement learning in scheduling procedures, investigating possible gains from learning system parameters such as channel quality, packet loss, and mobility. We focused on a small-scale deployment using Q-learning, which is suitable for such scenarios. Specifically, in this paper we introduced an innovative proportional fair (PF) scheduling algorithm that leverages the estimation and prediction of users’ future channel rates. The proposed scheme relies on Q-learning, a reinforcement learning technique, to optimize user selection allocation for each time slot.
To evaluate the performance of the suggested scheme, we developed a simulation framework and tested various scenarios, including different numbers of users under varying channel conditions. The simulation results demonstrate that the suggested scheme outperforms other scheduling algorithms, such as round-robin and traditional PF, in terms of the proportional fairness metric. We also examined the tradeoff between throughput and fairness under the suggested scheme.
To provide proof of concept (PoC), we implemented the proposed scheduler in a functional 4G base station, demonstrating the achievable gains with a real base station and users. The testbed results coincide with the simulation results.
Results indicate that if short-term fairness is the objective, an RL-based scheduling framework can offer significant benefits. However, if long-term fairness is considered, the traditional opportunistic PF scheduler can converge to the performance of the RL-based scheduler and is preferable due to its lower complexity.
It is crucial to note that our study focuses on a small-scale deployment, which aligns well with the capabilities of Q-learning. The 4G technology used as our infrastructure already necessitated simplifications in parameters such as CQI granularity, state space representation, number of users, and cycle length. These challenges would be significantly amplified in more complex environments like 5G or 6G networks.
Despite these constraints, we believe our work provides valuable insights and lays a foundation for future research. Subsequent studies could explore more advanced machine learning techniques, such as deep learning or actor–critic methods, to handle the complexities of larger-scale scenarios with more users and longer cycles. Another approach could involve narrowing the state space through state aggregation, which groups several states into meta-states. While this improves scalability, it may come at the expense of optimality (e.g., [
55]).
Overall, this research demonstrates the potential of reinforcement learning-based scheduling for downstream traffic in cellular networks, while acknowledging the need for further advancements to meet the demands of next-generation technologies. We believe this approach can be extended to various other scheduling scenarios, including different setups and objectives, such as scheduling upstream traffic, wireless sensor networks with power constraints, traffic with due date constraints, and downstream scheduling with noise rise constraints (e.g., [
56]).