Next Article in Journal
Hyperparameter Tuning of Load-Forecasting Models Using Metaheuristic Optimization Algorithms—A Systematic Review
Previous Article in Journal
OUCH: Oversampling and Undersampling Cannot Help Improve Accuracy in Our Bayesian Classifiers That Predict Preeclampsia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring Reinforcement Learning for Scheduling in Cellular Networks †

School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in International Symposium on Cyber Security, Cryptology, and Machine Learning; Springer International Publishing: Cham, Switzerland, 2022.
Mathematics 2024, 12(21), 3352; https://doi.org/10.3390/math12213352
Submission received: 24 May 2024 / Revised: 14 October 2024 / Accepted: 16 October 2024 / Published: 25 October 2024
(This article belongs to the Section Probability and Statistics)

Abstract

:
Cellular network scheduling is crucial for wireless deployments like 4G, 5G, and 6G and is a challenging resource allocation task performed by the scheduler located at the base stations. The scheduler must balance two critical metrics, throughput and fairness, which often conflict, as maximizing throughput favors users with better channel conditions, while ensuring fairness requires allocating resources to those with poorer channel conditions. The proportional fairness metric is a prominent scheduling approach that aims to balance these competing metrics with minimal compromise. The common strategy to attain proportional fairness relies on a greedy approach in which each resource block is allocated to the user who maximizes the proportional fairness criterion. With such a strategy, the scheduler can ensure that the resources allocated to the users at each time instance maximize the proportional fairness metric. However, users can usually tolerate some delay and are willing to accept temporary fairness imbalances if they ultimately improve their performance, provided that the fairness criterion is maintained over time. In this paper, we propose a new scheduler that uses reinforcement learning to enhance proportional fairness. The suggested scheduler considers both current and predicted future channel conditions for each user, aiming to maximize the proportional fairness criterion over a set of predefined periodic time epochs. Specifically, by learning patterns in channel fluctuations, our reinforcement learning-based scheduler allocates each resource block not to the user who maximizes the instantaneous proportional fairness metric, but to the user who maximizes the expected proportional fairness metric at the end of the current time epoch. This approach achieves an improved balance between throughput and fairness across multiple slots. Simulations demonstrate that our approach outperforms standard proportional fairness scheduling. We further implemented the proposed scheme on a live 4G eNodeB station and observed similar gains.

1. Introduction

The rapid and continuous growth in the user population, coupled with the dramatic increase in demand per user, has significantly heightened the pressure on air time, turning the wireless channel into a bottleneck and making resource allocation one of the most critical challenges in wireless communication. Resource allocation in wireless communication typically refers to intra-technology scenarios, where resources must be fairly distributed among users utilizing the same technology. Resource allocation can also involve inter-technology scenarios, where the wireless channel is shared among different technologies and protocols, necessitating the distribution of resources across them. For example, the coexistence of LTE and Wi-Fi in unlicensed bands requires fair air time sharing between different protocols (e.g., [1,2]). Unlike inter-technology scenarios, where the challenge lies in distributing resources among users operating under different technologies and protocols with varying syntax and semantics, intra-technology scenarios involve users operating under the same protocol and following the same rules.
The need to manage the rapid expansion of services and applications like the Internet of Things (IoT) and machine-to-machine (M2M) communications, combined with the increasing complexity of global mobile data traffic, has pushed the broadband cellular network industry (5G and the upcoming 6G) to embrace machine learning (ML) and Artificial Intelligence (AI). This shift is particularly evident in the vision for 6G, which aims to support AI by design through a cost-efficient, scalable, and flexible architecture that can abstract and distribute complexities. Resource allocation, especially scheduling, is a prime candidate for leveraging these AI-based frameworks. This potential is amplified by the emergence of new use cases, such as private and enterprise 5G networks, which are expected to become even more prevalent with 6G and bring more demanding Quality of Service (QoS) requirements. In light of these developments, there is a growing need for practical, low-complexity schemes that can be distributed and implemented at the network edge. These solutions would be crucial in managing the increasing complexities of resource allocation in future cellular networks.
In cellular networks, scheduling is one of the base station’s primary tasks, determining which users receive resources like channels and power for each uplink and downlink data transmission opportunity. This function has been essential across all generations of cellular technology (e.g., [3]). Typically, schedulers operate based on predefined performance criteria and goals, which may include maximizing overall data rates, minimizing average latency, or meeting specific deadline constraints (e.g., [4]). Several traditional scheduling algorithms, such as round-robin (RR) and Weighted Fair Queuing (WFQ), have been adapted for wireless systems [5]. However, in wireless networks, users are not homogeneous; different users not only have different abilities but also experience different channel conditions due to factors like fading, multipath propagation, and interference. Thus, schedulers that are oblivious to the users’ channel conditions and avoid the burden (and overhead) of acquiring any information often compromise performance. Accordingly, modern cellular networks (3G and beyond) have given rise to opportunistic schedulers that rely on channel state requisition from users and take into account physical layer information [4]. However, an approach that favors users based on their wireless channel quality and abilities can result in severe unfairness and even starvation of some users. To avoid such significant differences between the throughput attained by various users, a scheduler should adopt a notion of fairness. However, trying to maintain fairness among the users faces two obstacles. First, the notion of fairness is ambiguous and can assume different definitions, e.g., Max-Min fairness criterion, Jain’s fairness index, and Weighted Fair Queuing (WFQ) [6]. Second, attaining fairness can come at the expense of overall throughput. For example, balancing the throughput attained by all users can result in severe sum throughput degradation, as a user with poor channel conditions can consume most of the resources. Accordingly, a scheduler should try to balance attained throughput and fairness, recognizing the potential tradeoff between throughput and fairness (e.g., [7]).
One of the most accepted fairness metrics that tries to cope with this tradeoff is the proportional fairness (PF) criterion, which balances the tradeoff between overall system throughput and fairness among users. The notion behind PF is that the resources allocated to each user are proportional to their respective channel conditions and past throughput. Accordingly, users with better channel conditions receive more resources, yet not to the extent that it leads to significant imbalances. Formally, the proportional fair distribution is the one that maximizes k = 1 K log T k , where T k denotes the throughput that user k has received so far and K denotes the total number of users. Note that the logarithmic function ensures that the utility gained from allocating additional resources to a user decreases as the user’s throughput increases, which helps in allocating more resources to users with better channel conditions without starving users with poor channel conditions (more details on PF are given in Section 3.2).
Proportional fairness-based scheduling has been extensively researched, with numerous variations and implementations. Conventional schedulers typically maximize proportional fairness metrics for each scheduling instance, relying on current user channel states and cumulative throughput per user. Specifically, the cellular base station scheduler employs user feedback on current channel conditions and a rolling average of individual user throughput to assign the next resource block (time slot and subchannel). This allocation goes to the user that optimizes the proportional fairness metric for that particular scheduling instance (e.g., [8,9]). However, standard QoS requirements do not necessitate optimizing performance metrics at every single time instance. Instead, key performance indicators (KPIs) are usually assessed periodically rather than at each time slot. Consequently, there is potential to enhance proportional fairness performance at evaluation points by considering projected future channel conditions for users. This approach might involve postponing allocation for a user with suboptimal current channel conditions (despite maximizing the current proportional fairness metric) and waiting for improved channel conditions within the KPI cycle.
In this paper, we explore the potential gains of incorporating the reinforcement learning (RL) paradigm into the proportional fair (PF) downlink scheduler used in operational cellular networks. An RL-based scheduler can exploit the fact that performance in typical cellular networks is evaluated at the end of predefined cycles rather than after every individual time slot. It can anticipate future users’ channel conditions and delay the allocation of resources to a user, waiting for more favorable conditions. Accordingly, an RL-based PF scheduler allocates resources not only based on the average rate achieved so far and the current channel conditions of each user, as conventional schedulers do, but also relies on future channel state predictions.
We emphasize that the PF objective scheme proposed here is an illustrative example that can be extended to other scheduling objectives. For instance, the same scheme can be easily adapted to handle traffic with due-date constraints.
In this paper, our focus is on exploring the potential of leveraging RL methodology for scheduling purposes and understanding the mechanisms behind such an approach. Therefore, we rely on Q-learning, which is suited for small-scale setups but still provides valuable insights into the scheme.
We evaluated the proposed RL-PF algorithm through simulations, demonstrating its advantages over both the greedy PF scheduler and a round-robin-based scheduler (which allocates equal time shares to each user) across various network scenarios. Additionally, we implemented the RL-PF algorithm in an operational LTE base station, assessing its performance in scheduling users with different path losses. This served as a proof of concept for employing such a scheduler in a cellular base station.
Specifically, the main contributions of this study are as follows: (i) We highlight the significance of fairness criteria in scheduling by analyzing various channel quality indicators collected from a wide range of operational 4G deployments. (ii) We provide a formal definition of PF scheduling that evaluates performance at predetermined periodic intervals, occurring after a fixed number of time slots. (iii) We develop an RL-based downlink scheduler that forecasts and incorporates near-future user channel conditions. This scheduler allocates each resource block to the user expected to maximize the PF metric at the interval’s conclusion, utilizing Q-learning to understand individual user channel fluctuations and make appropriate user selections for each time slot. (iv) Using a Python-based simulation platform, we demonstrate that our algorithm outperforms the standard PF at the predefined time instances. (v) We implement the RL-PF algorithm in an operational LTE base station, proving its practicality and viability for real-world deployment.
This paper is organized as follows. Section 2 reviews related work. Section 3 provides a brief background on relevant topics discussed throughout this paper. In Section 4 we present various channel quality indicators, which are collected in multi-operational 4G deployments. The model is formulized in Section 5. The RL-based scheduler is presented in Section 6. The evaluation parts comprising simulation and implementation results are given in Section 7 and Section 8, respectively.

2. Related Work

In this study, we examine the utilization of the RL paradigm for a prescient scheduler that can learn and predict the channel states of the backlogged users associated with it and allocate resources based on these predictions. As a representative example, we consider the ubiquitous proportional fair scheduler (PFS). Accordingly, we cover related works in both aspects. We first discuss related works in the scope of ML. The second part of this section reviews related works in the scope of schedulers. Since the epitomizing example is related to fairness-based schedulers, we mostly concentrate on fairness-metric-oriented schedulers and in particular PF-based schedulers.
Reinforcement Learning Applications—The growing diversity and complexity of mobile network architectures have made monitoring and managing the multitude of network elements increasingly challenging. Consequently, embedding versatile machine learning-based solutions into future mobile networks is drawing unparalleled research interest. Reinforcement Learning (RL), a subset of ML, focuses on determining the actions needed to maximize a well-defined objective within an environment. Next-generation wireless technology will heavily rely on RL as it enables continuous, real-time optimization and decision-making in complex and dynamic network environments. Specifically, RL learns and adapts from interactions with the network, making autonomous decisions that enhance performance and resource allocation. This capability allows for adaptive management of network resources, efficient spectrum utilization, and proactive adjustments by predicting and addressing issues before they affect users. Accordingly, RL can significantly boost the performance of next-generation wireless networks in many aspects.
Over the last few years, numerous studies have incorporated RL into wireless communication protocols, demonstrating its effectiveness in various applications such as dynamic spectrum access, power control, load balancing, and more. This showcases RL’s potential to revolutionize wireless communication. For instance, recent examples include a survey on RL-based medium access control (MAC) protocols and methodologies for applying RL within the context of the MAC reference model [10]. Additionally, a deep reinforcement learning-based decentralized resource allocation scheme is suggested in [11] for the vehicle-to-everything (V2X) communication network, where vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) networks share resources. This scheme utilizes a deep Q-network (DQN) to determine the resource blocks (RBs) and transmit power of vehicles in the V2V network.
Current and next-generation wireless communication systems are characterized by high diversity, heterogeneity, and density. This complexity leads to resource scarcity, necessitating sophisticated resource allocation mechanisms. RL has emerged as a fertile field for innovations in addressing these challenges. Ref. [12] focuses on optimizing system throughput and fairness in an underlay device-to-device (D2D) communication system using device-centric joint spectrum and power allocation schemes. The study utilizes a distributed RL-based iterative resource allocation scheme that overcomes the unknown and dynamic nature of channel parameters by allowing the system to learn about the environment through trial and error, without requiring any prior knowledge. Ref. [13] studies the problem of energy-saving optimization in 5G base stations (BSs) by implementing multi-sleep mode (SM) levels. The paper investigates the tradeoff between energy saving and QoS and suggests a distributed Q-learning algorithm that can learn the best policy to control the activity of the small base stations (SBSs) in order to minimize energy consumption and data dropping. Ref. [14] investigates energy-efficient resource allocation and power control techniques in 5G networks. The paper presents a 5G base station (BS) sleep mode management based on reinforcement learning that balances a set of metrics to find the best tradeoff between energy reduction and QoS constraints.
A literature review on deep reinforcement learning (DRL) techniques and applications used for energy optimization and resource allocation in future wireless and mobile networks is provided in [15]. Ref. [16] proposes an adaptive DRL-based resource allocation scheme to jointly allocate downlink power and bandwidth, thus optimizing total throughput while considering the impact of power-consumption outage problems. Ref. [17] investigates user association in multi-unmanned aerial vehicles (UAVs)-assisted cellular networks, allowing users to connect with different UAVs or macro base stations for uplink (UL) and downlink (DL) transmissions. The paper formulates the joint optimization problem for UL-DL association and beamforming as a partially observable Markov decision process (POMDP) and employs multi-agent deep reinforcement learning (MADRL) for distributed optimization. Ref. [18] formulates the problem of maximizing the sum rate and fairness for non-orthogonal multiple access (NOMA)-based cellular users as a mixed-integer nonlinear programming (MINLP) problem, and a learning framework was employed to address it. Ref. [19] utilizes the multi-agent reinforcement learning (MARL) paradigm to develop distributed strategies to reduce processing complexity and energy use in large-scale multiple-input multiple-output (XL-MIMO) systems. In their multi-agent cell-free XL-MIMO system, base stations (BSs), user equipment (UE), and antennas are viewed as agents that learn to allocate resources and transmit signals efficiently. Two use cases of the MARL-aided multi-agent cell-free XL-MIMO system were studied: antenna selection and power control.
Additional state-of-the-art reinforcement learning (RL) algorithms related to resource allocation have been explored in various studies. In [20], researchers investigate RL for downlink inter-cell interference mitigation through power control and rate adaptation. They propose an RL framework to develop an efficient power control policy that optimizes the downlink transmission power of LTE cells, aiming to maximize network-wide utility for users. Their RL state representation includes LTE cell measurements such as cell downlink transmission power, average Reference Signal Received Power (RSRP), and average user interference within each cell. Actions are based on this representation adjust cell transmission power, with a focus on maximizing utility functions like average sum log-throughputs across cells. In [21], the authors design and implement an RL testbed for Multi-User MIMO (MU-MIMO) user selection in IEEE 802.11ax. Their approach applies Q-learning with the ϵ -Greedy policy [22] to select user groups for scheduling. In [23], a Q-learning algorithm for allocating RBs to users is devised. The algorithm relies on the scheduler receiving CQI feedback from all users before each RB allocation, assigning the RB to the user that maximizes Jain’s Fairness Index (JFI) [24].
In [25], the authors propose a multi-agent Q-learning algorithm for joint power and resource allocation in 5G networks. In this algorithm, each general eNodeB (gNodeB) performs joint power and resource block allocation at every scheduling interval. The states and rewards are formulated to consider two critical factors: inter-cell interference and queuing delay. The authors in [26] use a deep reinforcement learning (DRL) approach to optimize resource allocation in OFDMA systems. Specifically, they propose a centralized DRL-based scheduling scheme that is traffic- and channel-aware. This scheme relies on real-time network traffic and channel conditions to manage downlink OFDMA time-frequency resources at a wireless access point. In [27], a DRL-based resource allocation approach is developed for ultra-dense networks (UDNs) in the presence of very limited channel state information (CSI). The multi-objective problem that considers the tradeoff between spectrum efficiency (SE), energy efficiency (EE), and fairness is decomposed into two components. Specifically, the deep neural network is designed to maximize SE, while the EE and fairness considerations are incorporated as rewards to guide the training process. Ref. [28] explores the use of DRL framework to maximize user fairness in terms of delay. Ref. [29] proposes a 5G downlink scheduling mechanism that employs Q-learning techniques to improve wireless transmission quality and efficiently manage the radio resources of a BS. The mechanism aims to enhance various Quality of Service (QoS) parameters, including throughput, delay, and fairness, as measured by Jain’s fairness index. Ref. [30] presents a novel method to integrate fairness into actor-critic reinforcement learning for network optimization tasks. Their algorithm adapts the reward system by using a weighting factor that accounts for both the structure of the fairness utility function and historical reward statistics. This approach ensures that the rewards are adjusted dynamically based on these two elements.
Scheduling in Wireless Communications— The scheduler in a cellular-based network plays a fundamental role; it is the “brain” of the whole system [3]. Traditionally, its basic function is to allocate resources for the connected users, taking into account QoS and channel condition. Different schedulers consider different utility metrics, e.g., max throughput and various types of fairness (proportional, equal share, Max-Min, etc.), and therefore the network performance is influenced by the objective. Scheduling algorithms that were commonly adapted in implemented wireless systems disregarded users’ channels, for the sake of simplicity, e.g., round-robin (RR), which serves users in a circular manner without any other consideration, [31], and Weighted Fair Queuing (WFQ), which allocates resources with respect to weights that are associated with every user [32]. In contrast, 4G and 5G cellular networks have given rise to schedulers that take advantage of physical layer information, such as user channel states. Such schedulers are named opportunistic schedulers. Proportional fair scheduler [33] is one of the popular channel-aware schedulers implemented in today’s cellular systems and being as such, much work has been performed on the subject.
Opportunistic schedulers take into account information such as the users’ channel conditions and QoS, for the purpose of allocating resources for each connected user. Opportunistic scheduling was first coined in [34], showing that by utilizing multi-user diversity, a significant improvement in capacity can be made. Their approach was to schedule the user (or users) with the best channel state in any given time slot. The work of [35] was continued and improved in [34] by considering delay aspects, when they treat the situation of starvation. Best Quality Indicator (BCQI) [36] is a simple LTE scheduling algorithm that aims to maximize the cumulative throughput by assigning the network resources to users that report the best channel quality in the current time slot. Max-Weight scheduler [37], selects the user with the highest product of queue length and transmission rate. The algorithm was considered throughput-optimal until [38] proved otherwise under flow level dynamics. However, Max-Weight is still throughput-optimal with fully backlogged queues. We refer the interested reader to [3,4] for more details on the different scheduling algorithms. As mentioned, the aforementioned scheduling solutions are throughput optimized; however, optimizing the cell capacity comes with the cost of unfair resource allocations to those users that experience bad channel quality.
Since opportunistic schedulers can favor users over substantial time intervals, fairness is a major concern in communication systems, e.g., scheduling users according to their channels’ conditions might lead to starvation of users with poor channel quality, which is not only unfair but can lead users to unsubscribe from the service provider. The notion of fairness is ambiguous (i.e., what is a fair allocation and how we can measure it), and different metrics were suggested over the years to define fairness, e.g., Jain’s index, Max-Min, proportional fairness, and more.
Jain’s index is a widely known metric which we utilize in our simulations (Section 7). This is a quantitative measure, first proposed by [24] that aims at quantifying the fairness of resource allocation among traffic flows. Jain’s index values range between one divided by the number of flows and one, where a unity value indicates maximal fairness. Such a metric, however, disregards the heterogeneity of the users and their channel conditions, and tries to balance the rates attainable by all users, irrespective of their capabilities. Max-Min fairness is a qualitative measurement of fairness that has been studied widely and implemented in various applications [39], and its main motivation is that all users should attain the same performance (e.g., throughput). Specifically, a system reaches Max-Min fairness if it cannot increase any individual’s share without decreasing another’s that has a lower one. Accordingly, Max-Min fair allocation can allocate most of the resources to users with very poor capabilities (e.g., poor channel conditions), sacrificing overall performance in order to improve the performance for degraded users.
One of the most diffused opportunistic approaches with fairness constraints is the proportional fair scheduler (PFS), first introduced by Kelly in [8]. A set of attainable throughput (rate) allocations x k is proportionally fair if it maximizes k N log x k , where N is the set of users. Later, [9] adapted the proportional fairness criteria and suggested a scheduling algorithm that assigns priorities to users based on the ratio of two components: the first component accounts for the rate potentially achievable in the current time slot, while the second component accounts for the historical average of the user’s throughput. Such an algorithm was proven to maximize the logarithmic sum of capacities. The work of [40] suggested that [9] does not fully address the problem of starvation due to a constant average window size of the UE rate, which relates to the time a UE can be starved, in scenarios, for example, where a user experiences sudden drops in channel quality, forcing the PFS to starve the user for a certain time period. Instead of using a constant average window size, they offered to update the average UE rate by monitoring the UEs’ starvation time period, giving rise to a much more starvation-aware scheduling.
In [41], four PF scheduling algorithms were introduced that are based on the PF algorithm for handling cases where users are not fully backlogged and a user can be in an idle or active state. The first two algorithms only update the average throughput of active users, instead of updating average throughput for active and idle users. The third algorithm suggests a different approach for a user’s throughput update, according to the status of the user in the previous slot, i.e., active or idle, and the last time the user was scheduled. The fourth algorithm dynamically changes the averaging time window with respect to the backlog of users. Ref.  [42] extended PF scheduling to multi-carrier transmissions systems, proposing an assignment of users to each carrier while maximizing the sum of logarithmic average user rates.
PF channel-aware schedulers, such as those proposed by [43], rely on channel stationarity and achieve fairness by employing a moving average scheme. At each time slot, the channel is allocated to the UE that maximizes the average PF metric at the end of the current slot. These schedulers achieve fairness over long time intervals and can even converge to the optimal fair allocation as time approaches infinity (see [33]). However, typical operational networks measure performance and report key performance indicators (KPIs) periodically, after finite and predefined time intervals. Consequently, “greedy” schedulers that maximize fairness at every time slot may compromise overall performance. This is because it may sometimes be beneficial to postpone scheduling a user who maximizes the current fairness metric within the KPI time interval, waiting for that user to experience better channel conditions.
A strategy that allows the postponement of an allocation to a deserving UE, waiting for better channel conditions, does not ensure fairness after each allocation. However, such a strategy can boost overall performance at the end of each KPI time interval. To deploy this strategy effectively, a mechanism for predicting users’ channel conditions is required. Analysis of the gains from knowing future channel conditions is proposed in [44]. Specifically, the authors assume that future channel states for all users are known a priori (i.e., accurate estimates of the users’ attainable data rates for the following L 1 time slots are available to the scheduler at each time slot). They compute the optimal allocations for the finite L time slots and compare these to the PFS greedy allocation.
The authors of [45] extend the single-channel prediction-based PFS scheme suggested in [44] to evaluate the effects of considering predicted data rates on the performance of multi-carrier networks. Specifically, ref. [45] investigated a multiple subchannel model in which the rates each user can attain on each channel are known a priori, and the scheduler determines which user is allocated to each subchannel (a user may be assigned more than one subchannel). The authors show that the resulting optimization problem is a complex combinatorial problem in both the time and frequency dimensions and propose two suboptimal algorithms.

3. Preliminaries

This section provides a brief background.

3.1. Q-Learning Background

We provide a concise description of the Q-learning algorithm we utilize in this paper (a comprehensive background can be found in [46] and many other references).
Q-learning [22] is an off-policy reinforcement learning algorithm that seeks to find the optimal policy that maximizes the expected accumulated rewards starting from the current state. It does not require the transition probability distribution or the reward function hence it is termed a “model-free” algorithm. The basic idea behind Q-learning is to approximate the values Q ( s , a ) of state–action pairs through samples obtained during the interactions with the environment, enabling the agent to select the optimal action ( m a x a Q ( s , a ) ) for a given state (s). Accordingly, Q-learning maintains a two-dimensional matrix termed a Q-table, which stores the estimated Q-value of each state–action pair ( Q ( s , a ) ), independent of the policy being followed. This Q-table serves as a reference table for the agent to select the best action needed to be taken in a given state based on the Q-values associated with these states. The table’s dimensions are | S | × | A | , such that each state has an entry for each possible action (the table index represents a state and an action). The table entries are initialized to arbitrary values (in the results presented herein all Q-values were initialized to zero), and are updated after each action based on the result. These updates balance the estimated current value with the instantaneous reward gained by taking a specific action and the predicted Q-value of the new state. Specifically, these updates are defined as follows:
Q t + ( s , a ) = Q t ( s , a ) + α R t ( s , a ) + γ max Q t + 1 ( s , a ) Q t ( s , a )
where R t ( s , a ) is the immediate reward obtained as a result of being in state s and taking action a, γ is the discount factor that determines the importance of future rewards, and  α is the learning rate that is used to determine the impact of the new information upon the existing Q-value. The learning rate ( α ) can be chosen to be constant or adjusted dynamically during the learning process. The algorithm can terminate when all Q-values converge or after a certain number of iterations. The process can also restart (or continue indefinitely) in order to encounter changes in the environment (e.g., channel state distribution changes, different backlogged users).
During the learning stage, the agent explores different actions for each state and performs them multiple times to obtain an accurate Q-value estimation. To balance exploration and exploitation, we used the prevalent epsilon-greedy strategy. At the start, when the agent has little knowledge about Q-values, it mostly explores by selecting actions at random (high epsilon). As the agent gains more confidence in its Q-value estimates, epsilon decreases, and the agent exploits more by selecting the action with the max Q-value for a given state. Eventually, the agent terminates the learning stage and exploits its Q-table to choose the best action for each state.
Q-learning, being a tabular method for storing the Q-values, is limited by system complexity. As a result, the Q-learning algorithm might not be practical when trying to design a scheduler for a large cellular network. Other RL concepts such as neural networks can be utilized to handle such limitations.

3.2. Proportional Fairness (PF) Background

The proportional fairness criterion was introduced in Kelly’s seminal paper [47] as an alternative to the Max-Min fair metric, which can often allocate a significant amount of system resources to marginally improve the performance of the lowest-performing user. In the context of throughput distribution, the proportional fairness criterion can be expressed as k = 1 K log r k , where K represents the total number of users sharing the link and r k denotes the achievable rate for user k. Communication networks, particularly cellular networks, operate with indivisible resource units (see Appendix A.1 for details). This necessitates a modification to the fluid model, where service can be partitioned to allow all users to receive allocations simultaneously. Instead, the proportional fairness metric in these networks considers the average throughput a user receives over a specified time interval. Let T k represent the average rate received by user k. The proportionally fair distribution is then defined as the one that maximizes:
k = 1 K log T k
To achieve the proportional fairness criteria outlined in (2), ref. [33] proposes a user selection algorithm. This algorithm considers two key factors: the users’ channel quality in time slot t, which is converted to an instantaneous data rate R k ( t ) , and a moving average of each user’s achieved throughput T k ( t ) . The scheduling algorithm then selects the user k * that:
k * = arg max k K R k ( t ) T k ( t )
The average throughputs T k ( t ) are updated using an exponentially weighted low-pass filter
T k ( t + 1 ) = ( 1 1 t c ) T k ( t ) + 1 t c R k ( t ) , k = k * ( 1 1 t c ) T k ( t ) , k k *
Viswanath et al. [33] prove that the proposed proportional fair algorithm maximizes (2) when the time window approaches infinity ( t c ). However, practical networks like LTE and 5G evaluate performance metrics (e.g., KPIs) over finite time scales. Consequently, ensuring fairness over infinite or extended periods is often impractical and insufficient for real-world applications. Moreover, key performance indicators (KPIs) are typically assessed periodically at predetermined intervals rather than on a per-time-slot basis. This periodic assessment eliminates the need to maintain fairness at every individual time slot; instead, fairness needs to be ensured only at the conclusion of these specified evaluation periods.
Proportional fairness is the conventional fairness metric utilized by operational BSs, and numerous studies have suggested different proportional fairness-based algorithms; several examples are provided in Section 2.

3.3. From Theory to Practice: Implementing Proportional Fairness in 4G and 5G Networks

Opportunistic scheduling methods, such as proportional fairness, rely on users’ channel quality reports to optimize scheduling decisions. Specifically, the algorithm described earlier requires knowledge of all users’ instantaneous channel conditions, which are translated to the appropriate transmission rates, allowing the selection of the user that maximizes the equation arg max k R k ( t ) T k ( t ) . In wireless networks, this channel state information is obtained through reports that are sent occasionally by the users to the BS. Practical systems support only a finite set of rates. Accordingly, the channel’s condition should be adapted to the set of transmission rates supported by the users. Specifically, each user’s transmission rate should be chosen from its supported transmission rates that are compatible with its channel conditions, i.e., the transmitted packet is most likely to be received successfully by the BS under the current channel condition. Link Adaptation is the process by which transmission code rates and modulation schemes are chosen based on prevailing channel conditions and recent transmission success rates. This mechanism ensures that the most appropriate data rate is selected for each user given their current channel state.
The downlink link adaptation (DLLA) procedure is a crucial part of wireless communication systems. It can dramatically affect the performance of the network by controlling the data rate that can be reliably transmitted [48]. In Appendix A.1, we briefly explain the concepts and processes of a typical LTE DLLA that we utilized in the evaluation part, both in the simulation and experimental results. Since the description follows the common terminology and accepted acronyms, and is somewhat cumbersome, we have included it in the Appendix A.

4. Operational 4G Deployments’ Statistics

In this study, we propose an opportunistic RL-based scheduler that considers both current and anticipated future user channel quality. The advantages of the proposed scheme lie in two key aspects: its opportunistic approach to resource allocation and its ability to predict users’ channel states. While our focus is on fairness, we also recognize the applicability of these aspects to a broad range of objectives, such as due date-constrained traffic and QoS guarantees.
To understand the significance of these aspects, particularly the potential benefits of opportunistic schedulers, we analyze channel statistics collected from various operational 4G deployments from two perspectives: intra-cell and inter-cell UEs’ channel quality variability. Intra-cell channel quality variability highlights the differences in channel quality experienced by UEs within the same cell, thereby justifying the need for opportunistic resource allocation schemes that adapt to the varying channel qualities of UEs. Additionally, considering that the channel quality of each UE may fluctuate over time due to factors like user movement or environmental changes (e.g., movement of other UEs in the vicinity), this underscores the predictive nature of the proposed scheme. Inter-cell channel quality variability, on the other hand, underscores the diversity of the cell environments themselves and the variability in UE distribution within cells, emphasizing the importance of a distributed learning-based scheduling scheme conducted independently by each cell.
We utilized two datasets to illustrate the high diversity in users’ channel quality characteristics, which are correlated with the users’ attainable rates. The first dataset comprises data gathered by a 4G vendor from 90 operational 4G deployments across diverse locations in the United States, including both urban and rural environments. The second dataset consists of data collected from various 4G operators in the states of New York and Florida.
We examine three common KPIs: (i) Channel quality indicator (CQI): This KPI maps the users’ channel condition into a quantized discrete value between 0 and 15 (further details on CQI can be found in Appendix A.1). The same CQI values were used in both the simulation and implementation. (ii) RSRP is the average of the base station’s reference signals measured by the users, indicating the strength of the base station signal to a user. The farther the distance between the user and the base station, the lower the RSRP value received, and vice versa. We also compare RSRP with SNR in rural and urban deployments to provide further insights into how the environment type affects users’ channel conditions. (iii) Average resource utilization: Measured in physical resource blocks (PRBs), this KPI is assessed for both urban and rural deployments (further information on LTE resource allocation can be found in Appendix A.1).
We note that the schedulers deployed in the base stations that collected the data are PF-based, similar to those used in our simulation and implementation results for comparison with the proposed scheme. Additionally, the users’ channel characteristics obtained in this subsection are utilized in our simulations and implementation (Section 7 and Section 8, respectively).
Figure 1 depicts the average number of CQI reports collected for each CQI value during one day, averaged over the inspected cells, i.e., all measurements collected were divided by the total number of cells.
Figure 1 confirms the diversity in channel quality attained by users (recall that the measurements are taken throughout one day, hence the diversity shown encompasses both a variety of users as well as users whose channel state has been changed). As seen in the figure, even though most measurements reported good channel conditions and only a few experienced inferior ones, a substantial number of measurements encountered moderate channel conditions. Assuming that the distribution to CQI roughly reflects the temporal CQI distribution of backlogged users during a given time in the day, Figure 1 highly motivates the utilization of a fairness metric that balances between maximizing the attained users’ sum rate while providing each user a reasonable share of the resources. The observation that the diversity also relates to users with varying channel conditions motivates the use of an RL-based scheduler. The proportional fair scheduler, which we utilize in this study, is a channel-aware scheduler that utilizes users’ channel diversity for a better resource allocation policy and aims to balance the overall attainable rate and fairness, giving more RBs to users with high CQI values, yet allocating sufficient RBs to the low-CQI users such that the share each user is given conserves the ratio between the users’ average achievable rate. Note that the high standard deviation shown in the figure, representing the high variability, further emphasizes the need for an opportunistic channel-aware scheduler leveraging a learning-based solution.
Reference Signal Received Power (RSRP) is a key performance indicator (KPI) that assesses the average power of the reference signals received by the UE across a specific bandwidth. It is measured in dBm and typically ranges from about −140 dBm (very weak signal) to −44 dBm (very strong signal). The RSRP reports collected at urban cells and rural cells are depicted in Figure 2a and Figure 2b, respectively.
As can be seen in the figures, in both types of deployments, the received signal strength is widely dispersed, ranging from very poor RSRP ( 140 dBm to 110 dBm) to excellent RSRP ( 80 dBm to 44 dBm). The average RSRP values for the two kinds of deployments differ significantly: for urban cells it is −97 dBm, while for rural cells it is −104 dBm. This difference can be explained by the characteristics of the different deployments. In a more urban environment, more base stations are deployed, and therefore, users are mostly closer to their associated base stations. In contrast, in a rural area, fewer users are connected, and fewer base stations are deployed, such that each base station serves users who are farther away. Moreover, the RSRP distribution (especially in the rural deployment) seems to follow a right-skewed distribution. This result can be explained by the fact that most users in such deployments are in low-to-medium coverage areasfrom the base stations, while few users are in excellent coverage areas, extending the right tail into high RSRP values.
Overall, the RSRP measurements depicted in Figure 2 further support the CQI measurements in demonstrating the high diversity in signal path loss experienced by UEs, which indicates considerable differences in the received signal strength in both types of deployments.
Next, rather than averaging over a large number of cells, showing significant variance between the different cells, we focus on specific zones showing variability within a zone. We distinguish between urban and rural deployments. Figure 3 depicts a heat map indicating RSRP and SINR in the urban region of New York and the more rural region of Florida. The heat maps are divided into different zones according to zip codes. Specifically, Figure 3a and Figure 3c depict the RSRP of the urban region of New York and the rural region of Florida, respectively. We defined an RSRP threshold of 116 [dBm], which was taken from lab results and indicated a user with poor coverage, i.e., a user with a low CQI (a CQI that corresponds to QPSK modulation) and colored each zone based on the percentage of RSRP reports which were below this threshold (denoted by ϕ ). For example, the dark green color indicates that less than 5% of the RSRP reports in the zone were below 116  [dBm], and zones marked in red are zones in which more than 20% of the collected RSRP reports were below the 116 [dBm] threshold. Similarly, Figure 3b and Figure 3d show the average SINR reporting in the urban region of New York and the rural region of Florida, respectively, where the SINR threshold was set to 2 [dB], i.e., ϕ denotes the percentage of SINR reports in a zone which are below 2 [dB].
The RSRP heat maps depicted in Figure 3a,c confirm the assertion that the received signal strength in an urban environment is expected to be stronger than in a rural one. In contrast, Figure 3b,d indicate that in the urban environment, there is a large number of zones where users suffer from poor SINR conditions, even though the RSRP indicated is mostly good, whereas, in the rural area of Florida, an inverse effect is observed, more zones are associated with poor RSRP. However, most users experience good SINR. The seemingly contradicting observation can be explained by the fact that despite the high (low) received signal strength (RSRP) in an urban (rural) environment users in the urban (rural) environment are more (less) prone to inter-cell interference, and therefore have low SINR, such that counterintuitively, users’ SINR in the urban environment might experience worse channel conditions than in the rural environment.
Last, we examine the daily average system resource utilization per scheduling time unit in both rural and urban deployments. User resource allocations are measured in PRBs, which are the basic time-frequency resource units that can be allocated to a UE. The number of PRBs available in the system per scheduling time unit is determined by the LTE base station’s bandwidth. At each scheduling epoch, the available PRBs are distributed among the selected UEs. The rate, and therefore the number of bytes attained by each scheduled UE per PRB, depends on its channel conditions. The number of PRBs allocated to each scheduled UE is determined by its target number of bytes and its channel condition. A user in poor channel conditions will require more PRBs to support the required number of bits, and vice versa. Therefore, a higher average PRB utilization can be attributed to a higher number of UEs with poor channel conditions and to a higher number of UEs scheduled at each scheduling epoch.
The measurements show that the average number of PRBs allocated to each UE in rural terrain is 5.9 , while in urban terrain it is 12.4 . This indicates that the average PRB allocation in an urban deployment is approximately twice as much as in the rural counterpart. These results can be attributed to the observation that more UEs in an urban environment experience poor channel conditions than in the rural counterpart, as can be seen in Figure 3b,d. It can also correlate with the characteristic that the population density is typically higher in urban environments compared to rural ones.
In this section we illustrated the high variability in users’ channel quality, which highly motivates for an efficient fairness metric that balances between maximizing the throughput yet giving each user a substantial share of the available resources, one that shares the resources proportionally to its capabilities, yet allows each user to operate (run applications) satisfactorily.

5. Model

While cellular base stations support bidirectional data traffic, in this study we focus on the downlink traffic to K users. We generally follow the conventional cellular resource units in which the time is slotted into constant intervals, and each slot is divided into RBs. Each RB comprises a bandwidth and time duration, e.g., in LTE each RB comprises 12 sub-carriers in the frequency domain and 1 ms in the time domain. The RBs cannot be shared, and a single user can use each RB. The BS (scheduler) distributes these RBs between the users such that a subset is allocated to a single user. For simplicity, throughout most of this study, we will assume that the bandwidth cannot be partitioned, i.e., a single user has the entire bandwidth at each time slot (in this study, time slot and RB are interchangeable). However, we shall also present results in which the BS selects a single user for a subset of RBs in a time slot. Note that similar results will apply to a setup where the BW can be shared among different users and the BS selects a user for each subset of RBs in a specific time slot. We assume that users’ downlink transmission rates, which are based on each user’s channel quality and link adaptation module, are already determined by the BS prior to each time slot (see Appendix A.1 for details). We further assume a stochastic wireless channel, such that there is a probability that the channel between each UE and the BS can change. However, we assume that the channel remains static within the time slot, as in [49,50,51]. Prior to each time slot, the scheduler selects a user and transmits to it at a certain rate that with some probability might fail, despite the DLLA mechanism. Note that this loss probability can capture a model in which the channel can change during a time slot.
We assume that time is divided into L time slot cycles. The rate values received by the users are reset at the beginning of each cycle and are evaluated at the end of the cycle. The key performance indicators (KPIs) are measured at the end of each cycle based on the proportional fairness metric. Specifically, in this study we concentrate on proportional fairness, hence denoting the throughput user k received at the j-th slot of cycle i by T k i ( j ) , the accumulated throughput of user k at the end of cycle i, denoted by T k i , is T k i = j = 1 L T k i ( j ) . The system’s proportional fairness utility metric at the end of the i-th cycle is:
U i = k = 1 K log T k i
Throughout this study, we primarily focus on scheduling within a single cycle. To simplify our notation, we will omit the cycle index when discussing a general cycle. Consequently, T k ( j ) represents the throughput achieved by user k in the j-th time slot, and  T k denotes user k’s total accumulated throughput at the cycle’s conclusion.

6. Reinforcement Learning Proportional Fairness-Based Scheduler

In this section, we present the RL-PF-based scheduler. The motivation for an RL-based scheduler relies on the understanding that since the performance, and specifically the proportional fairness criterion (2), is measured only at the end of each cycle, it is sometimes beneficial to delay an allocation for a user, waiting for better channel conditions within the same cycle. We illustrate this with a simple example depicted in Figure 4.
Assume a BS serving two users ( K = 2 ), and that each cycle comprises ten time slots ( L = 10 ). The blue and orange curves in Figure 4a,b depict the attained rates by users UE-1 and UE-2, respectively, over an exemplary cycle. The asterisks in Figure 4a depict the scheduled user at each time slot according to the greedy PF algorithm described in Section 3.2. For example, the first slot is allocated to the user that maximizes the current PF criteria, which is UE-1 (since both users have received an equal share so far, zero bytes, the user with better channel quality is given the allocation); similarly, the second slot is allocated to UE-2 that recieved nothing so far in the cycle, hence contributes the most to the PF criteria, etc. Figure 4b depicts the allocations that are based on a scheduler that aims at maximizing the proportional fairness criteria at the end of the cycle, hence postpones the allocation for UE-2, waiting for it to attain better channel conditions. The blue and orange curves in Figure 4c depict the sum log throughput of the two users (Equation (2)) prior to every time slot according to the greedy PF algorithm and the delay-tolerant PF algorithm, respectively. Time slot j on the x-axis corresponds to the sum log throughput at the end of slot j ˘ 1 and the beginning of slot j, just before the scheduler selects the UE to be allocated slot j according to the two schedulers. As can be seen in the example, allocating slots 2 and 4–8 to UE-2 by the greedy PF scheduler provides a better instantaneous gain in (2) at the end of the first 8 slots out of the 10-slot cycle; however, allocating it slots 9–10 when it has better channel conditions attains higher PF utilization at the end of the cycle. Additional motivation can be found in [44], which shows that the availability of future channel estimations could help improve the proportional fairness criterion.
Maximization of U , as defined in (5) encounters two main challenges: (i) the scheduler (BS) needs to know all users’ channel conditions for the entire cycle at the beginning of the cycle, which necessitates users’ channel prediction, and (ii) attaining optimality requires an exhaustive search over all users’ scheduling possibilities. To address these challenges, we employ a Reinforcement Learning (RL) scheme characterized by four elements: agent, environment (states), actions, and rewards (see Section 3.1). The agent, in our case the BS or more specifically, the scheduling algorithm deployed at the BS, determines to which user to allocate each resource block. The environment encompasses the current channel states of all the users. The action space, which determines which user will be allocated the next resource block, comprises all users, and the reward is the sum log throughput that is attained only at the end of the cycle. The policy, which is the aim of the suggested scheme, defines the agent’s strategy for associating actions with states. In the following, we formalize the main components of the PF scheduling setup:
  • The state space is represented by a vector with K elements, where K is the number of users. Each entry in the vector consists of a 2-tuple ( T i ( n ) , C Q I i ( n ) ) , where T i ( n ) is the accumulated throughput that user i has achieved up to the start of time slot n in the current cycle, and  C Q I i ( n ) indicates the channel quality for user i at the beginning of time slot n. This structure results in a distinct set of states for each time slot within the cycle. Specifically, denoting by S n the set of states associated with the beginning of time slot n, we have:
    S n = { T 1 ( n ) , C Q I 1 ( n ) , T i ( n ) , C Q I i ( n ) ,                                                                                                   , T K ( n ) , C Q I K ( n ) } n = 1 , 2 , . . . , L
  • The action space, denoted by A n , is defined by a specific user scheduled at the beginning of time slot n, who will be allocated the entire bandwidth. The key challenge lies in the fact that the action (choosing which user to schedule next) taken at each time slot must be evaluated based on future rewards (delayed rewards). It is important to note that in this study, we deal with a deterministic action space, meaning the scheduler must choose a specific user to assign the resource block to. This contrasts with a stochastic action space, where the scheduler can allocate time slot n to different users according to some probability distribution, i.e., the time slot is allocated to each user i with a probability p i that is not necessarily 0 or 1 ( 0 p i 1 ).
  • The reward, denoted by R , is given only at the end of each scheduling cycle and is computed based on the proportional fairness criterion as formulated in Equation (5), i.e.,  R = k = 1 K log T k . An important characteristic of this reward structure is that intermediate actions do not yield instantaneous rewards. Instead, the reward is determined only once the entire cycle has been completed.
In this paper, we focus on understanding the potential gains of deploying RL for the PF-based scheduler, and we leave the challenge of handling an enormous state space (e.g., large user populations, high number of supported rates, etc.) for future work. Accordingly, we utilize Q-learning, a model-free RL algorithm that learns the value of each action for each state.
We employ the epsilon-greedy approach for our scheduling strategy. In this method, the scheduler follows the learned policy for the current state with a probability of 1 ϵ or, with a probability of ϵ , deviates from this policy to explore various actions. Specifically, when exploring, it selects a user randomly and allocates the resource block (RB) to that user (e.g., [52]). After each scheduling decision at time slot n, we calculate and store a Q-value in the Q-table. This value, denoted by Q n ( S n , a n ) , corresponds to the current state–action pair ( S n , a n ) . The Q-value is updated based on the previous Q-value, the maximal Q-value of the next state multiplied by the discount factor, and scaled by the learning rate. The specific formula for this update is detailed in Section 3.1.
Algorithm 21 depicts the pseudo-code of the proposed Q-learning-based scheduling algorithm.
Algorithm 1 Q-Learning PF-Scheduler
  1:
Initialize Q-table
  2:
while Training do
  3:
    Update S n with UEs’ CQI reports
  4:
    if  S n not in Q-table then
  5:
        Add S n to Q-table
  6:
    end if
  7:
    With probability ϵ select a random UE to schedule
  8:
    Otherwise select the UE that corresponds to:
       a k = argmax a A Q ( S n , a )
  9:
    Allocate DL transmissions to UE-k
10:
    Update UE-k throughput with the corresponding rate:
       T k = T k + r k
11:
    if  n < L  then
12:
        proceed to next time slot n = n + 1
13:
        Receive CQI reports
14:
         R = 0
15:
        Transit to next state S ^ n
16:
    else if  n = L  then
17:
        Calculate reward R = k = 1 K log T k
18:
        Initialize T k , k K
19:
    end if
20:
    Update Q ( S n , a k ) = Q ( S n , a k ) + α [ R + γ max Q ( S n ^ , a ^ ) Q ( S n , a k ) ]
21:
end while
The Q-learning scheduler algorithm we implemented consists of the following key components:
  • User Equipment (UE) Selection (lines 7–9): At each time slot, the algorithm decides which UE to schedule using the exploration/exploitation policy. It either chooses a random UE (exploration) or selects the UE that maximizes the learned reward (exploitation).
  • State Update (lines 10–15): After downlink (DL) transmission to the chosen UE, the algorithm updates the next state with the observed throughput and new channel quality indicator (CQI) reports.
  • Reward Calculation (not explicitly shown): The reward is calculated at the end of the scheduling cycle. There are no mid-cycle reward calculations, so the reward is zero during the cycle.
  • Next Cycle Initialization (lines 16–18): These lines depict the state initialization for the subsequent cycle.
  • Q-value Update (line 20): The current Q-value for time slot n is updated using the current reward and the difference between the maximum possible discounted future Q-value and the current Q-value. This adjustment steers the Q-value towards the maximum reward at the cycle’s end.
It is important to note that the reward calculated at the end of the cycle backward-propagates to all states involved in that cycle, influencing their Q-values. This structure allows the algorithm to learn and improve its scheduling decisions over time based on the outcomes of its choices.
The computational complexity of the BS for each update includes action selection, action execution, and Q-value update. In an exploration iteration, the BS randomly chooses which user to schedule next, with a time complexity of O ( 1 ) (line 7 in Algorithm 1). The exploitation iteration involves a lookup in the Q-table at the index of the current state and choosing the action with the highest value. Since the BS can choose between N possible actions (N users to choose from), it needs to perform N 1 comparisons to find the one with the greatest Q-value among the N users, resulting in O ( N ) mathematical operation (line 8 in Algorithm 1). The Q-table update for both exploration and exploitation involves updating a single Q-value according to the Q-value equation (line 20 in Algorithm 1). To update the Q-value, the BS first updates the accumulated attained rate of the scheduled user by adding the rate attained in the last slot to the aggregate rate received by the user so far in the current cycle (line 10 in Algorithm 1), which requires a single arithmetic operation. To update the Q-value, the BS retrieves the previous Q-value Q ( s , a ) and the current Q-values Q ( s , a ) for the new state s , with time complexities of O ( 1 ) and O ( | N | ) , respectively. Note that the new state is determined based on the channel state collected from the users (line 13 in Algorithm 1). The max operation to find max a Q ( s , a ) takes O ( | N | ) time. The arithmetic operations, which include the temporal difference calculation δ = r + γ max a Q ( s , a ) Q ( s , a ) and the update Q ( s , a ) Q ( s , a ) + α δ , are O ( 1 ) each. Overall, the time complexity per Q-learning update is O ( | N | ) and includes N + 1 lookups, 1 comparison, and 5 arithmetic operations. Note that a reward is given only at the end of the cycle ( n = L ). Accordingly, at the end of each cycle, the BS needs to calculate the reward (line 17 in Algorithm 1), which includes taking the logarithm of N values and summing them, resulting in 2 N 1 arithmetic operations.
The space complexity (memory complexity) for storing the Q-table requires storing the Q-values for all the state–action tuples for the entire cycle. The state comprises both the accumulated rate each user has attained so far in the cycle as well as its current channel state. Since the throughput a user attains when scheduled is based on the transmission rate, which is directly associated with its channel state, the number of different throughput gains a user can accumulate when scheduled is equal to the number of Index of Transport Block Size (ITBS-see Appendix A.1). Accordingly, denoting the number of different ITBSs by η I T B S , each cycle stage can branch into N · η I T B S different states. Since each cycle is composed of L stages (allocations), the table contains N · η I T B S L elements. Note that if the transmission can fail, the number of elements in each stage (inside the parenthesis) should be multiplied by 2, corresponding to successful and unsuccessful transmissions. Even though the latter leaves the accumulated rates unchanged, it corresponds to a different stage and hence relates to a different state.
Overall, we observe that the computational complexity remains manageable as the system size increases and is comparable to other schedulers, as any selective scheduling algorithm must account for all users. However, the space complexity poses the most significant scalability barrier. Specifically, the exponential growth of the state space with respect to cycle length indicates that this approach does not scale efficiently for large systems or longer cycles. For example, a small-scale deployment serving 50 users with 26 different ITBS values (as defined by the LTE standard) and a cycle of 10 slots would require a Q-table containing 50 · 26 10 O ( 10 31 ) elements, which is clearly infeasible. Moreover, beyond the memory limitations, an extremely long learning phase would be required to explore all these states sufficiently.

7. Simulation

In this section, we evaluate the performance of the proposed RL-based algorithm through simulations under different parameters and configurations. We begin by presenting the simulation setup.

7.1. Simulation Setup

We developed a Python-based simulation platform to explore the proposed Q-learning-based algorithm for downlink scheduling. The simulation model consisted of a single base station (BS) serving K User Equipment (UE) devices. All UEs were assumed to have fully backlogged downlink traffic, ensuring the BS always had data to transmit to each UE. The system operated on a time-slotted basis, with one UE scheduled per time slot. We employed a time-varying block-fading channel model, where each UE’s channel was represented by a two-state Markov chain, following the Gilbert–Elliott model [53,54]. Each UE supported two transmission rates, r 1 and r 2 , with r 1 > r 2 corresponding to the traditional notation of ‘good’ (G) and ‘bad’ (B) channel states, respectively. Specifically, r 1 and r 2 allowed for transmission of 4587 bytes and 225 bytes per resource block (RB), corresponding to CQI values of 15 and 1 in the LTE standard, respectively. The channel state transition probabilities, denoted by P r o b ( G G ) = p k and P r o b ( B B ) = q k (i.e., P r o b ( G B ) = 1 p k and P r o b ( B G ) = 1 q k ) , were independently chosen for each UE. The channel state remained constant throughout each time slot. The BS maintained channel state estimations for each UE (detailed in Section 3.3) and transmitted at rates corresponding to these estimations. The simulation was structured in cycles, each comprising L time slots. We ran each configuration for a minimum of 1000 cycles. Performance was evaluated according to the average PF achieved across all simulations. We explored various combinations of p k and q k values. Table 1 summarizes the key simulation parameters and notations used in this section.
The Q-learning parameters used in the simulation were as follows: the discount rate γ = 0.9 , and epsilon was decayed exponentially according to ϵ ( t ) = ϵ m i n + ( ϵ m a x ϵ m i n ) × e k t , where t is the current iteration number, ϵ m a x = 1 , ϵ m i n = 0.1 , and k = 0.9 is the decay parameter. The learning rate was α = 0.1 . The number of iterations used in the simulation was approximately 100,000.
We evaluated our algorithm against three benchmark approaches: (i) The traditional round-robin scheme, which serves as a baseline. This method allocates time slots to UEs in a fixed, cyclical order, disregarding channel conditions. It achieves fairness simply by ensuring each UE receives approximately the same number of RBs. (ii) The widely used opportunistic PF scheme, as described in Section 3.2. (iii) The optimal scheduling, which retrospectively searches at the end of each cycle for the scheduling that would have maximized the PF criterion (5) based on the UEs’ attained channel states. It is important to note that the optimal scheduling is not only computationally intensive and thus not scalable, but more importantly, it is not feasible in practice. This is because it determines which UEs should have been scheduled after the cycle has already finished, using information about channel states that would not have been available at the time of actual scheduling. We utilize this optimal scheduling approach to provide an upper bound on PF performance.
Since the scheduler’s objective is to maximize the PF metric ( U i = k = 1 K log T k i ), the results presented depict the average PF attained per user per cycle ( A v g i U i ). Additionally, we present the average accumulated throughput per user per cycle, A v g i ( k = 1 K T k i ) , which provides complementary insights into the system’s performance, particularly in understanding the tradeoff between efficiency and fairness. In some of our results, we also present an additional fairness-related performance metric, Jain’s fairness index, which is a widely recognized measure in network resource allocation, defined as:
J a i n s f a i r n e s s i n d e x = ( k = 1 K T k ) 2 K × k = 1 K T k 2
where T k represents the throughput of user k, and K is the total number of users.

7.2. Simulation Results

Due to the enormous state space involved, we focus on a small number of users, starting with only two and later extending to three and four users.

7.2.1. 2-UE Scenario

To gain some insight, we start with a simple 2-UE setup where one user (UE-1) is constantly in a good channel state (i.e., p 1 = 1 , and q 1 = 0 ), while the second user (UE-2) alternates between good and bad channel states ( p 2 = 0.3 , and q 2 = 0.7 ). Both UEs have backlogged traffic at the BS at all times (fully backlogged model).
Figure 5 compares the average sum log throughput, the average Jain’s fairness index, and the average cell throughputs for a scheduling cycle of L = 5 . The average of all three metrics was calculated over all scheduling cycles.
Figure 5 presents a comparative analysis of the suggested RL-based algorithm against the other three scheduling approaches with respect to three metrics: PF, Jain’s fairness index, and cell throughput. PF metric performance: Figure 5a shows that the RL-based scheduler nearly matches the performance of the optimal scheduler, achieving the optimal schedule in over 95 % of cycles. It attained a 3 % improvement in the PF metric compared to the standard PF scheduler. Jain’s fairness index: Figure 5b shows that in terms of fairness, the RL-based scheduler not only outperforms the PF scheduler in the proportional fairness measure but also as measured by the Jain’s fairness index. Overall cell throughput: Figure 5c shows that the RL-based scheduler yields a 12 % increase in aggregated cell throughput compared to the standard PF scheduler. When compared to the round-robin (RR) scheduler, which does not consider UEs’ channel conditions or attainable rates, the RL approach shows a 36 % throughput improvement. These results highlight the effectiveness of our RL-based approach in balancing fairness and throughput optimization, outperforming traditional scheduling methods across multiple performance metrics, even in small-scale environments.
In order to gain some insight into the last result, we compare the decisions made by the two schedulers, PF and RL-PF, during a single cycle. Specifically, in Table 2 we present a representative 5-slot cycle in which UE-1 experiences a good channel condition in all time slots, while UE-2 experiences a good channel state only in the fifth slot. The table depicts the channel state (CS) realization of both UEs in each time slot (G\B channel), the respective attainable transport block (TB), and the accumulative number of bytes sent by each UE so far in the cycle ( T c u r r e n t i = k = 1 C u r r e n t T k i ). The table also presents the relevant parameters for the two schedulers, the PF values attained by assigning the slot to each of the UEs, computed by the PF scheduler prior to each slot and the Q-values for the two actions, assigning the slot to UE-1 or to UE-2. The highlighted cells indicate the chosen UE for allocation based on the two schedulers, specifically, the UE that provides the highest increase in PF for the PF scheduler and the one (the action) with the highest Q-value for the RL-PF scheduler.
Each cycle begins with initializing the aggregate throughput of all UEs to zero. Since the PF metric is a concave function (the logarithm is a concave function), the lower the aggregate bytes sent by a UE, the higher the marginal PF gain for additional δ bytes. Consequently, the PF scheduler, which maximizes the PF metric at each transmission opportunity, is likely to distribute the first few slots among the UEs that have not yet received any allocations. The allocation order will be determined based on their instantaneous attainable rates. Indeed, it can be seen that the PF scheduler assigns the first slot to UE-1, which is given the highest rate, and the second time slot to UE-2 even though it experiences a poor channel and a much lower attainable throughput, as it provides a better instantaneous PF value (6.01 vs. 3.96).
Note that even if UE-2 had experienced good channel conditions in the first slot, since its attainable rate under good channel conditions is the same as that of UE-1, the slot would have been assigned arbitrarily to one of the UEs, regardless of the likelihood of the UEs experiencing good channel conditions. Additionally, if the rate attainable by UE-2 is only slightly lower than that of UE-1, the slot would be allocated to UE-1, even if the probability of UE-2 achieving such high rates is rare and much lower than the probability of UE-1 achieving this high rate.
In contrast to the RL scheduler, the RL-PF scheduler has learned that it can postpone UE-2’s allocations while waiting for it to attain better channel conditions. Consequently, the first two slots are allocated to UE-1, which has good channel conditions.
Examining the retrospective optimal strategy (oracle scheduler) reveals that if UE-2 had only experienced bad channel conditions throughout the cycle, it should have been allocated two time slots. If UE-2 had obtained a single good channel condition time slot, it should have been allocated that slot and only that slot. If UE-2 had obtained two good channel condition time slots, it should have been allocated those two slots and only those two. If UE-2 had obtained three or more good channel condition time slots, one of the UEs should have received three time slots while the other received only two. Since the good channel condition rates are the same for both UEs, which UE received the three time slot allocation would be arbitrary.
Even though the RL-PF scheduler cannot know a priori the future channel conditions of the UEs, it has learned that UE-2 should be allocated at least one and at most three time slots. Accordingly, since in the presented cycle UE-2 was not allocated any of the first three slots (attained only bad channel conditions), and in the fourth time slot it also attained bad channel conditions, the BS needs to determine whether to allocate this slot to UE-1 or UE-2. Note that, in order to converge to the retrospective optimal strategy, the allocation should be based on UE-2’s next slot channel condition. Specifically, if the channel state in the last slot eventually turns out to be bad, both the fourth and last slots should be allocated to UE-2. However, if the channel state in the last slot eventually turns out to be good, the fourth slot should be allocated to UE-1, and only the last slot should be allocated to UE-2. Since the RL-PF scheduler cannot know a priori what channel conditions UE-2 will experience in the last time slot, it relies on the learned probability that UE-2 is more likely to experience bad channel conditions than good ones, and therefore allocates the slot to UE-2. Consequently, the PF scheduler assigned two time slots to UE-1 and three time slots (one good and two bad) to UE-2, while the RL-PF scheduler assigned three time slots to UE-1 and only two time slots (one good and one bad) to UE-2, resulting in a better PF value at the end of the cycle. Specifically, the PF values attained at the end of the cycles were 25.46 for the PF scheduler and 25.98 for the RL-PF scheduler (see Remark 1 below).
Note that the schedule attained by the RL-PF scheduler for this channel realization is not the optimal allocation. As previously mentioned, the optimal allocation would assign to UE-2 only the last time slot, in which it experienced good channel conditions.
Clearly, the policy attained by the RL-PF scheduler is related to the distribution of rates that each UE can attain in each channel condition and the ratio between these rates. Different rates will result in different learned policy. Typically the policy learned cannot guarantee optimal allocation for all channel realizations, as usually there is no such policy when the future is not known a priori. Note that despite the nonoptimal allocation example discussed above that stresses the fact that no policy can guarantee the optimal allocation, the optimal schedule was reached only in 95 % of the scheduling cycles (see Figure 5).
Remark 1.
Note that the Q-values at the end of each cycle, i.e., Q ( s L ^ , a ^ ) s L S L , a k A converge to the actual PF values for the associated states.
Specifically, the Q-value is updated every time the system visits a state and chooses an action. Let us count the number of times each state and action tuple is visited, and denote the Q-value of state s n and choose action a, after visiting the state and choosing the same action for the m-th time, by Q m ( s n , a ) . Let us concentrate on the state and action tuples that are related to the last (L-th) slot of the cycle ( Q m ( s L , a ) ).
Since the accumulated throughput of each user is reset at the end of each cycle, the Q-value of the next step is zero ( Q ( s ^ L + 1 , a ^ ) = 0 s ^ L + 1 S , a k A ) . Consequently, the L-th and last update at the end of the cycle (Algorithm 1, line 20) is Q m ( s L , a ) = ( 1 α ) Q m 1 ( s L , a ) + α R ( s L , a ) = ( 1 α ) Q m 2 ( s L , a ) + α R ( s L , a ) + α R ( s L , a ) = = ( 1 α ) m Q 0 ( s L , a ) + α i = 0 m 1 ( 1 α ) i R ( s L , a ) = ( 1 α ) m Q 0 ( s L , a ) + 1 ( 1 α ) m R ( s L , a ) . When the number of times that each state is visited goes to infinity, ( 1 α ) m 0 . Accordingly, ( 1 α ) m Q 0 ( s L , a ) 0 regardless of the initial Q-value of Q 0 ( s L , a ) , and Q m ( s L , a ) R ( s L , a ) , which is the sumlog throughput associated with ( s L , a ) .
Accordingly, the last two Q-values depicted in Table 2 for the RL-PF scheduler (21.98 and 25.98) are the actual values attained by the PF metrics for allocating four time slots to UE-1 and one bad channel slot to UE-2 (21.98) and allocating three time slots to UE-1 and two slots, one bad and one good channel slot to UE-2 (25.98).

7.2.2. 3-UE Scenario

We repeat the comparison described above for three UEs for two different setups. Specifically, we keep the cycle length on 5 slots ( L = 5 ) and examine two different scenarios, as follows: (i) UE-1 is kept at a constant good state, i.e., p 1 = 1 , q 1 = 0 , UE-2 is kept at a constant bad state, i.e., p 2 = 0 , q 2 = 1 , and the last UE alternates between good and bad with probabilities, p 3 = 0.3 , and q 3 = 0.7 . (ii), UE-1 is kept at a constant good state, p 1 = 1 , q 1 = 0 , while the other two UEs alternate between good and bad states, i.e., p 2 , p 3 = 0.3 and q 2 , q 3 = 0.7 .
Figure 6a,b depict the average sum log throughput, the average cell throughput, respectively, for both setups. Since the differences in the Jain’s fairness index are negligible (a negligible gain was observed favoring the RL-based scheduler), they were omitted from the figures. Figure 6a, which depicts the sum log throughput of the four algorithms, shows the benefit of the RL-based scheduler over the PF standard scheduler with a 4 % increase in setup (i), and 5 % increase in setup (ii). Figure 6b shows the average cell throughput of all three scheduling algorithms for each case; an 8 % increase in performance was observed with the RL scheduler compared with the PF for setup (i), while a much higher increase of 21 % was observed for (ii).
As previously mentioned, providing insight into the learned scheduler policy can be quite challenging. However, the first three-user setup, in which two UEs have fixed channel states, was chosen to allow for a better understanding of the attained policy. Specifically, it is clear that the scheduler will distribute the slots among the UEs, aiming to allocate a slot to a UE when its channel is good. Since, in a cycle of length L = 5 , the slots cannot be equally distributed among the UEs, and since each UE must be allocated at least one time slot, the scheduler (policy) must determine which slots to allocate to which users (how much to postpone UEs with bad channel conditions) and to whom to allocate the “extra” two time slots (preferably to the one experiencing good channel conditions).
The attainable rates are such that the PF metric value for allocating two time slots to a UE with good channel conditions, and two and one slots to UEs with bad channel conditions, is higher than allocating three time slots to a UE with good channel conditions and only one slot to each of the other UEs (with bad channel conditions). Accordingly, the RL-PF scheduler has learned the following:
(i)
Allocate two and only two time slots to UE-1, which always experiences good channel conditions, regardless of the channel states of the other UEs. These two time slots can be allocated at any point in the cycle. Thus, whenever one of the other UEs experiences good channel conditions, the RL-PF scheduler delays the allocation to UE-1, even though it also experiences good channel conditions.
(ii)
Allocate only one time slot to UE-2, which always experiences bad channel conditions. This slot can be allocated at any point in the cycle.
(iii)
Allocate the remaining two time slots to UE-3, preferably when it is in a good channel state.
Thus, similar to the two-user setup, the allocations to UE-3 will be delayed until it is in a good channel state; whenever it experiences good channel conditions, it receives the slot. If it does not receive its two slot allocations earlier, these allocations will be delayed toward the end of the cycle (one or two time slots, depending on how many it has received so far), regardless of whether the channel state is good or bad.
Setup (II) presents a bit more of a challenge. Similar to the previous setup, UE-1, which always has a good channel state, will receive two slots, while the other two UEs will share the remaining three slots. Since both UE-2 and UE-3 have the same channel characteristics (the same probability of being in each state and the same rate attained in each state), there is no advantage in favoring one over the other. Consequently, if both UEs experience the same channel conditions (good or bad), the RL-PF scheduler’s decision of which UE to allocate a slot to is arbitrary.
The scheduler will allocate one of the two slots to UE-2 and UE-3 whenever they are in a good channel state. If no slots were assigned to these two UEs in the first two time slots (both having experienced bad channel conditions), the first two slots would be allocated to UE-1. The remaining three slots will be assigned as follows:
(i)
Better Channel Conditions: If one of the UEs has better channel conditions, it will be assigned the slot.
(ii)
Equal Channel Conditions: If they experience the same channel state, the allocation will be arbitrary, ensuring that each one obtains at least one slot.
(iii)
Final Slot Allocation: If, prior to the last slot, one of the UEs has received no allocations, it will be allocated the last slot regardless of its channel state.
Obviously, for this setup, there is no policy that can ensure an optimal allocation, as when the two UEs (UE-2 and UE-3) are in a tiebreaker and experience the same channel conditions, the scheduler cannot know which one to favor since it does not know a priori what channel conditions they will experience in the following slots. To illustrate this assertion, we examine the Q-values obtained for a cycle in which UE-2 experiences bad channel conditions in the first three time slots and good channel conditions in the last two time slots, while UE-3 experiences bad channel conditions in all five slots.
Table 3 shows the Q-table for the last three time slots after the first two slots were assigned to UE-1, with both UE-2 and UE-3 having experienced bad channel conditions.
As can be seen, the RL-PF scheduler makes a “wrong” decision compared to an Oracle scheduler by assigning the third slot to UE-2 instead of UE-3. However, as previously explained, the expected channel states of the UEs cannot be known a priori, and the scheduler must allocate slots based on its accumulated experience. In this specific scenario, since UE-2 experiences good channel conditions in the last two slots, it would have been more beneficial to allocate the third slot to UE-3 and reserve the last two slots for UE-2. This strategy would effectively exchange one bad channel condition slot for one good channel condition slot for UE-2. However, note that in cycles where the channel conditions experienced by all three UEs are the same as in the cycle depicted in Table 3, but with UE-2 and UE-3 switching positions (i.e., UE-2 experiences bad channel conditions in the last two time slots while UE-3 experiences good channel conditions), the scheduler’s decision to allocate the third slot to UE-2 would be the “right” choice.
Interestingly, it can be observed that in the depicted cycle, even though neither UE received allocations in the first two time slots and both experienced bad channel conditions in the third time slot, the Q-values for assigning the third time slot to UE-2 or UE-3 are not the same, with a preference for allocating it to UE-2. This discrepancy in Q-values can be attributed to the finite learning process of the RL-based scheduler. During learning, the scheduler may have gathered slightly different statistics for the channel conditions of the two UEs, leading it to favor UE-2 for the third slot. This suggests that, based on the experience accumulated so far, the scheduler perceives a higher likelihood that UE-3 will encounter better channel conditions in the subsequent slots, which influences the Q-values and, consequently, the allocation decision.

7.2.3. 4-UE Scenario

We further scale the number of users, and in this section we examine a similar setup as in the previous sections in which users experience a good–bad channel model, L = 5 , for four UEs. We examine three different setups: (i) UE-1 and UE-2 always experience good channel state, i.e., p 1 = p 2 = 1 , UE-3 always experiences bad channel state, q 3 = 1 , and the last UE alternates between good and bad with probabilities, p 4 = 0.3 , and q 4 = 0.7 . (ii) UE-1 always experiences good state, p 1 = 1 , q 1 = 0 , while UE-2 and UE-3 alternate between good and bad channel states, i.e., p 2 , p 3 = 0.3 and q 2 , q 3 = 0.7 , while UE-4 always experiences a bad channel state, i.e., q 4 = 1 . (iii) UE-1 always experiences good channel state, i.e., p 1 = 1 while UE-2, UE-3, and UE-4 alternate between good and bad channel state with p 2 = p 3 = p 4 = 0.3 and q 2 = q 3 = q 4 = 0.7 . Similar to previous scenarios, we first present the average sum log throughputs and the average cell throughput in Figure 7.
Similar to previous scenarios, in both performance metrics, RL-PF outperforms the common PF scheduler. Specifically, Figure 7a, which depicts the average sum log throughput, shows gains of 3 % , 5 % , and 6 % by using RL scheduling for the three setups, respectively. Figure 7b, which shows the results of the average cell throughput, indicates a 3 % , 8 % , and 16 % gain over the standard PF for the three setups, respectively. As in the 2-UE scenario, the RL-PF scheduler attains almost the same PF metric results as the optimal one.

7.2.4. Unreliable Channel

In this section, we extend our analysis to a more realistic scenario where downlink (DL) transmissions may fail. We introduce a probability of successful transmission, denoted as P k s u c c < 1 , for each UE-k. To further reflect actual operational cellular networks, we account for delayed feedback (ACK/NACK) due to processing and uplink allocation delays. Specifically, since in operational cellular networks the receiver’s feedback is not received after each and every transmission, we have assumed in our simulations that the feedback is delayed and received only after the cycle has ended. Note that under this setup, the scheduler cannot recover a failed transmission and compensate the UE by allocating to it a subsequent slot, as it is not aware of such failed transmissions until the cycle has already ended. Furthermore, under this framework, a UE can achieve zero throughput at the end of a cycle if all its received allocations result in failed transmissions. To keep the sum log throughput results in scale, for cases where a UE attains zero throughput, we assign a value of 0 (rather than negative infinity) to the l o g T term when a UE’s throughput (TT) is zero. This more realistic model allows us to assess the robustness of our scheduling algorithms under imperfect transmission conditions, providing insights into their performance in practical network environments.
We revisit the simple 2-UE setup from Section 7.2.1, where UE-1 is near the BS, consistently experiencing good channel conditions and no losses. UE-2 is closer to the cell edge, with its channel state alternating between good and bad (probabilities: p 2 = 0.3 , q 2 = 0.7 ). We examine two transmission failure scenarios for UE-2: (i) P 2 s u c c = 0.3 (most DL allocations for UE-2 fail) and (ii) P 2 s u c c = 0.9 (most UE-2 transmissions succeed). For the optimal scheduler to compute attainable rates, it needs to know in advance the result of all possible scenarios (not only the channel state of each UE at each time slot but also whether a transmission to this UE at this time slot will succeed). Accordingly, we simulate a transmission for both UEs in every time slot, determining success based on each UE’s probability, and allow the optimal scheduler to examine all possible allocations and choose the best among them. It is important to note that since the optimal scheduler allocates slots a posteriori with knowledge of transmission success, it will not assign a slot to a UE whose transmission will fail if the other UE’s transmission succeeds.
Figure 8 illustrates the average sum log throughput and the average cell throughput for both setups. Figure 8a shows that the RL-based algorithm outperforms the PF algorithm in terms of average sum log throughput (PF metric), with a 25 % gain in setup (i) (16.5 for RL-PF vs. 12.3 for PF and 20.1 for optimal scheduling) and a 6 % gain in setup (ii) (24.7 for RL-PF vs. 23.4 for PF and 25.6 for optimal scheduling). Figure 8b reveals that this PF metric improvement comes at the cost of average cell throughput. PF scheduling achieves higher cell throughput than the RL-based scheduler in both setups: 14.9, 8.7, and 20 Mb/s for PF, RL-PF, and optimal, respectively, in setup (i), and 17.4, 17.3, and 20.8 Mb/s for PF, RL-PF, and optimal, respectively, in setup (ii). This performance difference can be attributed to the RL-based scheduler’s learning process. It recognizes that UE-2’s transmissions are not always successful and consequently allocates more resources (time slots) to it. Some of these extra transmissions occur at a low rate, and some fail (especially in setup (i)). In contrast, the PF-based scheduler, which does not take failed transmissions into account, favors UE-1, whose transmissions are always successful and at a high rate. As a result, even though the proportional fairness criterion aims to balance fairness and channel utilization, the RL-based scheduler, which is aware of the loss probability of disadvantaged UEs, allocates a substantial number of resources to these less-advantaged UEs to compensate for potentially unsuccessful transmissions. Notably, both scheduling algorithms fall short of the optimal scheduler’s performance. The optimal scheduler, which can evaluate all 2 5 possible schedules and their outcomes a priori, takes into account both the successful transmissions of each UE and the rates of these transmissions.

8. Implementation

To demonstrate the feasibility of the proposed scheme, we implemented the Q-learning RL-PF algorithm within an operational LTE base station serving three connected users.
The implementation testbed features a single-cell 4G network operating on a 10 MHz bandwidth. It employs a split architecture where the RF component, known as the remote radio head (RRH), is separated from the upper layers (such as PHY and MAC). These upper layers are implemented on a commercial off-the-shelf (COTS) server, referred to as the baseband unit (BBU). In this setup, the RRH is connected to two RF splitters, enabling the 2 × 2 RRH to interface with three RF cages housing the users. To precisely control each user’s path loss to the LTE cell, variable attenuators (V/A) are integrated into each RF connection. The MAC layer hosts the scheduler entity, which is responsible for making scheduling decisions. It is configured to allocate resources to one user per time slot. Figure 9 provides a visual representation of this testbed architecture.
Our system evaluation used a typical three-user configuration, representing diverse user channel state characteristics. Specifically, UE-1 had near-ideal channel conditions ( 98 dB path loss), UE-2 had moderate channel conditions ( 118 dB path loss), and UE-3 had challenging channel conditions ( 138 dB path loss). To ensure continuous data demand, we generated UDP downlink traffic for each UE, maintaining a fully backlogged state across all users.
In addition to the proposed RL-PF scheduler, we also implemented the PF scheduler as formulated in Section 3.2. Furthermore, similar to our simulation approach, we tracked the channel states attained by each UE throughout the cycle. At the cycle’s end, we computed the optimal schedule that could have been achieved if all channel states (ITBSs) were known in advance. These implementations enable us to compare the performance of the RL-based scheduler against both the implemented PF scheduler and the theoretical optimal scheduling scheme under realistic network conditions.
The implementation of the RL-PF-based scheduler on 4G base station required several code adaptations in the MAC layer where the scheduler and the downlink link adaptation are implemented. Specifically, besides the modules that are associated with the RL, which were coded from scratch, some modules such as the queuing, DLLA, UE scheduling, and the DL TB assignment needed to be modified in order to support the RL-PF-based scheduler. A further detailed description of the modifications is given in Appendix A.2.
The RL-PF-based scheduler described in Section 6 required a minor adaptation for practical implementation. The original algorithm assumes that the channel quality between the BS and each UE is available prior to every time slot. However, in operational systems, Channel State Information (CSI) reports are typically received from UEs every 80 ms, not every time slot. To maintain accurate channel quality estimation, which is crucial for rate adaptation, operational BSs utilize additional feedback mechanisms. Specifically, they use ACKs/NACKs from each DL packet transmission to update channel quality estimates and adjust transmission rates accordingly. This process, known as downlink link adaptation (DLLA), is detailed in Appendix A.1. To address the limited granularity of CQI reporting, we modified the RL-PF state representation. Instead of relying directly on CQI, we use the ITBS, which is the output of the link adaptation mechanism. This ITBS incorporates both CQI reports and ACK/NACK feedback, providing a more frequently updated measure of channel conditions. Consequently, the implemented state space is redefined as follows:
S n = { T 1 ( n ) , I T B S 1 ( n ) , , ( T i ( n ) , I T B S i ( n ) ) , , ( T K ( n ) , I T B S K ( n ) ) } n = 1 , 2 , L
where S n denotes the state at the beginning of time slot n, and is composed of the accumulated number of bytes attained so far in the cycle and the ITBS of each UE-k connected to the base station.
The learning process is conducted offline following a measurement session where the UEs’ ITBS and throughputs for each time slot were recorded. ACK/NACK feedback, which is received by the scheduler 4 ms after the downlink transmission as specified by the 4G standard, is taken into account. After collecting all feedback from each UE, the actual throughputs at the end of a scheduling cycle are compiled. These throughputs are then used to calculate the PF criteria, which are utilized in the learning process. The Q-learning parameters used were the same as those used in the simulation results.

Implementation Results

The testbed results for the three UEs, using a cycle length of 5 slots (L = 5), comparing the RL-PF scheduler with the PF scheduler and the optimal schedule, are presented in Figure 10. Specifically, Figure 10a shows the average sum log throughputs, while Figure 10b displays the average cell throughput.
Figure 10a shows that the average sum log throughputs for the PF scheduler, RL-PF scheduler, and optimal schedule are 25.7, 30.4, and 36.3, respectively, demonstrating an 18 % improvement of the RL-PF scheduler over the PF scheduler. Figure 10b shows average cell throughputs of 19.4, 12.6, and 31.1 Mb/s for the PF scheduler, RL-PF scheduler, and optimal schedule, respectively, indicating a 35 % cell throughput loss for the RL-PF scheduler compared to the PF scheduler. This is consistent with the simulation observations in Section 7.2.4.
This tradeoff between fairness and performance is well recognized. In our setup, as explained in Section 7.2.4, it is attributed to the RL-PF scheduler’s strategy of allocating extra RBs to UEs with higher packet drop ratio (PDR) in order to mitigate potential transmission failures. This strategy has a dual effect: not only does the RL-PF-based scheduler assign more slots to UEs with lower success probabilities, increasing the average number of unutilized slots, but these UEs are typically also the ones with poor channel quality (lower data rates). Consequently, the PF scheduler achieves a higher number of utilized slots for UEs with high data rates (greater overall cell throughput) at the cost of reduced fairness (lower sum log throughputs). Note that even though the downlink link adaptation (DLLA) mechanism, explained in Appendix A.1, is responsible for converging the users to a target block error rate (which is, as per design, the same for all users), some users still suffer from higher loss probability.
Specifically, in our setup, UE-1 has consistently superior channel conditions. Hence, the RL-PF scheduler allocates to it a single time slot. UE-2 and UE-3, with different channel conditions but similar probabilities of successful transmission due to the DLLA mechanism, share the remaining time slots (each is allocated two time slots in our setup). Furthermore, since the RL-based scheduler cannot know a priori which transmissions will fail and must rely on anticipated loss probabilities, in some cycles it can underperform in both metrics. For example, when all five slots are successfully transmitted, the PF scheduling policy outperforms the RL-based policy in both the sum log throughputs and the aggregate cell throughput metrics. On the other hand, since the PF scheduler allocates only a single transmission (time slot) to one of the UEs with inferior channel conditions, which occasionally fails, in many cycles one of these UEs will attain zero throughput. Overall, as can be seen in Figure 10a, on average the RL-PF-based scheduler attains much higher proportional fairness scores.

9. Conclusions

This research aimed to explore the potential of reinforcement learning in scheduling procedures, investigating possible gains from learning system parameters such as channel quality, packet loss, and mobility. We focused on a small-scale deployment using Q-learning, which is suitable for such scenarios. Specifically, in this paper we introduced an innovative proportional fair (PF) scheduling algorithm that leverages the estimation and prediction of users’ future channel rates. The proposed scheme relies on Q-learning, a reinforcement learning technique, to optimize user selection allocation for each time slot.
To evaluate the performance of the suggested scheme, we developed a simulation framework and tested various scenarios, including different numbers of users under varying channel conditions. The simulation results demonstrate that the suggested scheme outperforms other scheduling algorithms, such as round-robin and traditional PF, in terms of the proportional fairness metric. We also examined the tradeoff between throughput and fairness under the suggested scheme.
To provide proof of concept (PoC), we implemented the proposed scheduler in a functional 4G base station, demonstrating the achievable gains with a real base station and users. The testbed results coincide with the simulation results.
Results indicate that if short-term fairness is the objective, an RL-based scheduling framework can offer significant benefits. However, if long-term fairness is considered, the traditional opportunistic PF scheduler can converge to the performance of the RL-based scheduler and is preferable due to its lower complexity.
It is crucial to note that our study focuses on a small-scale deployment, which aligns well with the capabilities of Q-learning. The 4G technology used as our infrastructure already necessitated simplifications in parameters such as CQI granularity, state space representation, number of users, and cycle length. These challenges would be significantly amplified in more complex environments like 5G or 6G networks.
Despite these constraints, we believe our work provides valuable insights and lays a foundation for future research. Subsequent studies could explore more advanced machine learning techniques, such as deep learning or actor–critic methods, to handle the complexities of larger-scale scenarios with more users and longer cycles. Another approach could involve narrowing the state space through state aggregation, which groups several states into meta-states. While this improves scalability, it may come at the expense of optimality (e.g., [55]).
Overall, this research demonstrates the potential of reinforcement learning-based scheduling for downstream traffic in cellular networks, while acknowledging the need for further advancements to meet the demands of next-generation technologies. We believe this approach can be extended to various other scheduling scenarios, including different setups and objectives, such as scheduling upstream traffic, wireless sensor networks with power constraints, traffic with due date constraints, and downstream scheduling with noise rise constraints (e.g., [56]).

Author Contributions

Conceptualization, O.G., N.G., E.B. and A.C.; validation, N.G.; investigation, O.G., N.G., E.B. and A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors declare no conflicts of interest.

Appendix A

Appendix A.1. Downlink Link Adaptation (DLLA)

Downlink link adaptation (DLLA) refers to the procedure responsible for adapting the transmitted data rate to the channel quality to ensure reliable information transmission (e.g., [48]). The DLLA process is an important feature of modern wireless communication systems and has been integrated as a core feature in cellular standards like LTE. In this section, we briefly describe the concepts and processes of typical LTE DLLA, which we utilize in our evaluation (Simulations, Section 7, and Experimental Results, Section 8). Even though we have adopted the LTE DLLA procedure and follow the terminology and accepted acronyms of the LTE standard, the concepts presented herein apply to various other standards such as 5G.
The BS needs to determine the transport block (TB), which is the physical layer payload determined by the MAC layer (i.e., the packet passed between the MAC and physical layers). Two basic factors determine the TB size for a backlogged UE (assuming the amount of backlogged data does not limit the TB): the modulation and coding scheme (MCS) and the number of resources assigned (scheduled) to the UE. The link adaptation (LA) function in the MAC layer of the BS is responsible for indicating the appropriate MCS for upcoming transmissions, ensuring that the block error rate (BLER) for each UE remains below a target level. The scheduler is responsible for the distribution of resources among the UEs.
To determine the appropriate modulation and coding scheme (MCS), the BS needs to estimate the channel quality between itself and the UE. These estimates are based on channel quality indicators (CQIs) reported by the UE to the BS. The UE sends these CQI reports periodically, typically once per Transmission Time Interval (TTI), and they contain information about the quality of the downlink communication channel.
The UE computes the CQIs based on its evaluation of reference signals sent periodically by the BS. This channel estimation process varies depending on the UE’s chipset manufacturer. A key component in this assessment is the Reference Signal Received Power (RSRP), which the UE measures and uses to compute a Link Quality Metric (LQM). This LQM provides a quantitative measure of the downlink channel quality and forms the basis for determining the CQI. In LTE systems, the predominant LQM used is the Exponential Effective Signal-to-Noise Ratio Mapping (EESM), [57]. Once the link quality is assessed, the system employs a process called inner loop link adaptation (ILLA) to choose the most appropriate modulation and coding scheme (MCS) based on these assessments (e.g., [58]).
The scheduler in LTE systems is responsible for allocating radio resources in both the time and frequency domains. Time in LTE is divided into 1 ms intervals, each corresponding to 14 OFDM (Orthogonal Frequency-Division Multiplexing) symbols. In the frequency domain, the total bandwidth is divided into subchannels of 180 kHz, with each subchannel consisting of twelve consecutive and equally spaced OFDM sub-carriers. The smallest allocatable unit of radio resources is called a physical resource block (PRB), or simply a resource block (RB). Each PRB spans one 1 ms time slot (14 OFDM symbols) in the time domain and twelve consecutive sub-carriers (180 kHz) in the frequency domain.
The number of available RBs varies depending on the system’s bandwidth configuration, as the subchannel size remains fixed at 180 kHz. The scheduler’s role is to distribute these RBs among the UEs in each time slot. This allocation process considers various factors, including Quality of Service (QoS) requirements, system load, traffic demands, fairness criteria, channel quality, and other relevant parameters. By considering these factors, the scheduler aims to optimize the use of available radio resources and meet the diverse needs of all connected UEs.
The BS uses a pre-calculated table that maps the selected MCS and the number of PRBs allocated to the UE to a transport block size index (ITBS). This integer is then used to determine the transport block (TB) size to be transmitted to the UE. In our evaluation, we utilized 26 ITBS values (ranging from 1 to 26) as defined by the LTE standard.
Relying solely on CQI reports to determine the appropriate MCS can result in errors and unsuccessful transmissions due to various factors such as inaccurate CQI measurements, delayed or outdated reports, and deviations from assumed channel conditions caused by factors like multipath environments and UE movement [59]. To address these issues, a complementary process known as outer loop link adaptation (OLLA) is required. OLLA’s correction mechanism is based on hybrid automatic repeat request (HARQ) feedback and works as follows: the mapped ITBS, which relies on the UE’s CQI report, denoted as I T B S ( C Q I ) , is adjusted by a factor called I T B S m a r g i n based on each received positive/negative acknowledgment (ACK/NACK) from the UE. Specifically, when an ACK is received, I T B S m a r g i n is decreased by Δ d o w n , and when a NACK is received, the margin is increased by Δ u p . The ratio Δ d o w n Δ u p is controlled by the target block error rate ( B L E R T ) that OLLA aims to achieve, given by:
Δ d o w n Δ u p = B L E R T 100 B L E R T
When the target B L E R T is set to 10 % , it implies that the system is designed to tolerate up to 10 % of downlink transmissions being unsuccessful. In other words, the expectation is that at least 90 % of downlink transmissions to the UE should be successful. This target BLER serves as a balance between aggressive data transmission rates and transmission reliability. It allows for some transmission errors, which can be corrected through retransmission mechanisms while maintaining an acceptable level of overall performance and efficiency in the downlink communication.
The OLLA updates are formulated as follows:
I T B S m a r g i n = I T B S m a r g i n Δ d o w n if ACK I T B S m a r g i n + Δ u p if NACK
I T B S = I T B S ( C Q I ) I T B S m a r g i n
A schematic description of the DLLA process is illustrated in Figure A1.
Figure A1. DLLA block diagram.
Figure A1. DLLA block diagram.
Mathematics 12 03352 g0a1

Appendix A.2. Implementation Code Design and Model Modifications

The implementation of the RL-PF scheduler involved modifications to the 4G base station code, specifically in the MAC layer where the scheduler and downlink link adaptation are implemented. The main modules that were altered for the implementation of RL-PF are illustrated in Figure A2.
Figure A2. The different modules and algorithm flow that were implemented.
Figure A2. The different modules and algorithm flow that were implemented.
Mathematics 12 03352 g0a2
The following is the module description and algorithm flow of the implemented RL-PF:
  • Queueing
    After an indication of the DL UDP data for each connected UE, a queueing module is initiated, inserting the UEs with the DL traffic into an RL-PF scheduling queue for the purpose of scheduling one of them later on in the time slot. The queuing module also notifies the RL state presentation module of the waiting UEs to be scheduled.
  • RL State Presentation
    The RL state presentation module is initiated upon a UE insertion to the scheduling queue and is responsible for maintaining each UE’s RL-PF state, as formulated in Section 8. The module is notified of an ITBS change by the DLLA and a throughput change by the DL TB assignment (a downlink transmission). If a UE was scheduled for transmission, the corresponding UE throughput was updated by the module, regardless of whether the transmission was successful. If the queueing module signals the RL state presentation then, scheduling needs to take place and therefore, access to the Q-table is carried out and the chosen UE for scheduling is retrieved.
  • Q-Table
    The Q-table holds the Q-values that indicate which UE should be scheduled in the time slot for each corresponding RL-PF state.
  • DLLA
    The DLLA module updates the UEs’ accurate channel conditions based on the UEs’ CQI reports and HARQ feedback to achieve a target BLER—in our case, 10%. The module determines the ITBS used for transmission and notifies the RL state module of any ITBS updates. Typically, the ITBS values range from 1 to 26, as defined by the LTE standard. However, as previously noted, Q-learning is most effective for small-scale learning. Therefore, we limited the ITBS range to three distinct values for each UE, depending on path loss. For example, if UE-1 has a strong link and experiences a path loss of 98 dB, it shall be allocated downlink resources with I T B S 1 ( n ) in the range of I T B S 1 ( n ) { 24 , 25 , 26 } in each time slot n.
  • UE Scheduling
    The UE scheduling module is responsible for the appropriate processing to check if the UE can be scheduled through the cell for the current time slot. As mentioned in the testbed, the scheduler is configured for one user scheduling in each time slot, therefore, as there are certain services that are part of the system that are prioritized for scheduling, e.g., re-transmissions and signaling, a check must be made to ensure that in the corresponding time slot the chosen UE can be scheduled for new downlink transmission. Once verified, the DL transport block (TB) assignment module is called.
  • DL TB Assignment
    The DL TB assignment determines and calculates the resources and transport block size (TBS) required for transmission of the DL service based on the ITBS. The calculated TBS is used for updating the UE’s throughput metric in the RL state presentation module.

References

  1. Patriciello, N.; Lagen, S.; Bojović, B.; Giupponi, L. NR-U and IEEE 802.11 technologies coexistence in unlicensed mmWave spectrum: Models and evaluation. IEEE Access 2020, 8, 71254–71271. [Google Scholar] [CrossRef]
  2. Song, H.; Cui, Q.; Gu, Y.; Stüber, G.L.; Li, Y.; Fei, Z.; Guo, C. Cooperative LBT design and effective capacity analysis for 5G NR ultra dense networks in unlicensed spectrum. IEEE Access 2019, 7, 50265–50279. [Google Scholar] [CrossRef]
  3. Capozzi, F.; Piro, G.; Grieco, L.A.; Boggia, G.; Camarda, P. Downlink packet scheduling in LTE cellular networks: Key design issues and a survey. IEEE Commun. Surv. Tutor. 2012, 15, 678–700. [Google Scholar] [CrossRef]
  4. Asadi, A.; Mancuso, V. A survey on opportunistic scheduling in wireless communications. IEEE Commun. Surv. Tutor. 2013, 15, 1671–1688. [Google Scholar] [CrossRef]
  5. Tsai, T.Y.; Chung, Y.L.; Tsai, Z. Introduction to packet scheduling algorithms for communication networks. In Communications and Networking; IntechOpen: London, UK, 2010; p. 434. [Google Scholar]
  6. Huaizhou, S.; Prasad, R.V.; Onur, E.; Niemegeers, I. Fairness in wireless networks: Issues, measures and challenges. IEEE Commun. Surv. Tutor. 2013, 16, 5–24. [Google Scholar] [CrossRef]
  7. Kabaou, M.O.; Zoghlami, N.; Hamouda, H.; Baabou, F. Performance evaluation of opportunistic schedulers based on fairness and throughput in new-generation mobile networks. J. Supercomput. 2023, 79, 18053–18088. [Google Scholar] [CrossRef]
  8. Kelly, F.P.; Maulloo, A.K.; Tan, D.K. Rate control for communication networks: Shadow prices, proportional fairness and stability. J. Oper. Res. Soc. 1998, 49, 237–252. [Google Scholar] [CrossRef]
  9. Tse, D. Multiuser diversity in wireless networks. In Wireless Communications Seminar; Standford University: Stanford, CA, USA, 2001. [Google Scholar]
  10. Zheng, Z.; Jiang, S.; Feng, R.; Ge, L.; Gu, C. Survey of reinforcement-learning-based mac protocols for wireless ad hoc networks with a mac reference model. Entropy 2023, 25, 101. [Google Scholar] [CrossRef]
  11. Han, D.; So, J. Energy-efficient resource allocation based on deep Q-network in V2V communications. Sensors 2023, 23, 1295. [Google Scholar] [CrossRef]
  12. Jayakumar, S.; Nandakumar, S. Reinforcement learning based distributed resource allocation technique in device-to-device (D2D) communication. Wirel. Netw. 2023, 29, 1843–1858. [Google Scholar] [CrossRef]
  13. El Amine, A.; Chaiban, J.P.; Hassan, H.A.H.; Dini, P.; Nuaymi, L.; Achkar, R. Energy optimization with multi-sleeping control in 5G heterogeneous networks using reinforcement learning. IEEE Trans. Netw. Serv. Manag. 2022, 19, 4310–4322. [Google Scholar] [CrossRef]
  14. Malta, S.; Pinto, P.; Fernández-Veiga, M. Using reinforcement learning to reduce energy consumption of ultra-dense networks with 5G use cases requirements. IEEE Access 2023, 11, 5417–5428. [Google Scholar] [CrossRef]
  15. Archi, A.; Saadi, H.A.; Mekaoui, S. Applications of Deep Reinforcement Learning in Wireless Networks-A Recent Review. In Proceedings of the 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM), Medea, Algeria, 28–29 November 2023; Volume 1, pp. 1–8. [Google Scholar]
  16. Luo, J.; Chen, Q.; Tang, L.; Zhang, Z.; Li, Y. Adaptive resource allocation considering power-consumption outage: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2023, 72, 8111–8116. [Google Scholar] [CrossRef]
  17. Ji, J.; Cai, L.; Zhu, K.; Niyato, D. Decoupled association with rate splitting multiple access in UAV-assisted cellular networks using multi-agent deep reinforcement learning. IEEE Trans. Mob. Comput. 2023, 23, 2186–2201. [Google Scholar] [CrossRef]
  18. Vishnoi, V.; Budhiraja, I.; Gupta, S.; Kumar, N. A deep reinforcement learning scheme for sum rate and fairness maximization among d2d pairs underlaying cellular network with noma. IEEE Trans. Veh. Technol. 2023, 72, 13506–13522. [Google Scholar] [CrossRef]
  19. Liu, Z.; Zhang, J.; Liu, Z.; Du, H.; Wang, Z.; Niyato, D.; Guizani, M.; Ai, B. Cell-free XL-MIMO meets multi-agent reinforcement learning: Architectures, challenges, and future directions. IEEE Wireless Commun. 2024, 31, 155–162. [Google Scholar] [CrossRef]
  20. Ghadimi, E.; Calabrese, F.D.; Peters, G.; Soldati, P. A reinforcement learning approach to power control and rate adaptation in cellular networks. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–7. [Google Scholar]
  21. Karmakar, R.; Chattopadhyay, S.; Chakraborty, S. Intelligent mu-mimo user selection with dynamic link adaptation in IEEE 802.11 ax. IEEE Trans. Wirel. Commun. 2019, 18, 1155–1165. [Google Scholar] [CrossRef]
  22. Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, Cambridge, UK, 1989. [Google Scholar]
  23. Feki, S.; Zarai, F.; Belghith, A. A Q-learning-based Scheduler Technique for LTE and LTE-Advanced Network. In Proceedings of the WINSYS, Madrid, Spain, 24–26 July 2017; pp. 27–35. [Google Scholar]
  24. Jain, R.K.; Chiu, D.M.W.; Hawe, W.R. A Quantitative Measure of Fairness and Discrimination; Eastern Research Laboratory, Digital Equipment Corporation: Hudson, MA, USA, 1984. [Google Scholar]
  25. Elsayed, M.; Erol-Kantarci, M. Reinforcement learning-based joint power and resource allocation for URLLC in 5G. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
  26. Balakrishnan, R.; Sankhe, K.; Somayazulu, V.S.; Vannithamby, R.; Sydir, J. Deep reinforcement learning based traffic-and channel-aware OFDMA resource allocation. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
  27. Liao, X.; Shi, J.; Li, Z.; Zhang, L.; Xia, B. A model-driven deep reinforcement learning heuristic algorithm for resource allocation in ultra-dense cellular networks. IEEE Trans. Veh. Technol. 2019, 69, 983–997. [Google Scholar] [CrossRef]
  28. López-Sánchez, M.; Villena-Rodríguez, A.; Gómez, G.; Martín-Vega, F.J.; Aguayo-Torres, M.C. Latency fairness optimization on wireless networks through deep reinforcement learning. IEEE Trans. Veh. Technol. 2022, 72, 5407–5412. [Google Scholar] [CrossRef]
  29. Liu, J.C.; Susanto, H.; Huang, C.J.; Tsai, K.L.; Leu, F.Y.; Hong, Z.Q. A Q-learning-based downlink scheduling in 5G systems. Wirel. Netw. 2023, 1–22. [Google Scholar] [CrossRef]
  30. Chen, J.; Wang, Y.; Lan, T. Bringing fairness to actor-critic reinforcement learning for network utility optimization. In Proceedings of the IEEE INFOCOM 2021-IEEE Conference on Computer Communications, Vancouver, BC, Canada, 10–13 May 2021; pp. 1–10. [Google Scholar]
  31. Miao, G.; Zander, J.; Sung, K.W.; Slimane, S.B. Fundamentals of Mobile Data Networks; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
  32. Pinedo, M.; Hadavi, K. Scheduling: Theory, algorithms and systems development. In Operations Research Proceedings 1991; Springer: Berlin/Heidelberg, Germany, 1992; pp. 35–42. [Google Scholar]
  33. Viswanath, P.; Tse, D.N.C.; Laroia, R. Opportunistic beamforming using dumb antennas. IEEE Trans. Inf. Theory 2002, 48, 1277–1294. [Google Scholar] [CrossRef]
  34. Knopp, R.; Humblet, P.A. Information capacity and power control in single-cell multiuser communications. In Proceedings of the IEEE International Conference on Communications ICC ’95, Seattle, WA, USA, 18–22 June 1995; Volume 1, pp. 331–335. [Google Scholar]
  35. Bettesh, I.; Shamai, S. A low delay algorithm for the multiple access channel with Rayleigh fading. In Proceedings of the Ninth IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (Cat. No. 98TH8361), Boston, MA, USA, 8–11 September 1998; Volume 3, pp. 1367–1372. [Google Scholar]
  36. Aramide, S.O.; Barakat, B.; Wang, Y.; Keates, S.; Arshad, K. Generalized proportional fair (GPF) scheduler for LTE-A. In Proceedings of the 2017 9th Computer Science and Electronic Engineering (CEEC), Colchester, UK, 27–29 September 2017; pp. 128–132. [Google Scholar]
  37. Andrews, M.; Kumaran, K.; Ramanan, K.; Stolyar, A.; Vijayakumar, R.; Whiting, P. Scheduling in a queuing system with asynchronously varying service rates. Probab. Eng. Informational Sci. 2004, 18, 191–217. [Google Scholar] [CrossRef]
  38. van de Ven, P.; Borst, S.; Shneer, S. Instability of maxweight scheduling algorithms. In Proceedings of the IEEE INFOCOM 2009, Rio de Janeiro, Brazil, 19–25 April 2009; pp. 1701–1709. [Google Scholar]
  39. Radunovic, B.; Le Boudec, J.Y. A unified framework for max-min and min-max fairness with applications. IEEE/ACM Trans. Netw. 2007, 15, 1073–1083. [Google Scholar] [CrossRef]
  40. Yang, J.; Yifan, Z.; Ying, W.; Ping, Z. Average rate updating mechanism in proportional fair scheduler for HDR. In Proceedings of the IEEE Global Telecommunications Conference, 2004. GLOBECOM’04, Dallas, TX, USA, 29 November–3 December 2004; Volume 6, pp. 3464–3466. [Google Scholar]
  41. Tsai, J.T. State-dependent proportional fair scheduling algorithms for wireless forward link data services. In Proceedings of the IEEE INFOCOM 2008-The 27th Conference on Computer Communications, Phoenix, AZ, USA, 13–18 April 2008; pp. 2414–2422. [Google Scholar]
  42. Kim, H.; Han, Y. A proportional fair scheduling for multicarrier transmission systems. IEEE Commun. Lett. 2005, 9, 210–212. [Google Scholar]
  43. Borst, S. User-level performance of channel-aware scheduling algorithms in wireless data networks. In Proceedings of the IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), San Francisco, CA, USA, 30 March–3 April 2003; Volume 1, pp. 321–331. [Google Scholar]
  44. Bang, H.J.; Ekman, T.; Gesbert, D. Channel predictive proportional fair scheduling. IEEE Trans. Wirel. Commun. 2008, 7, 482–487. [Google Scholar] [CrossRef]
  45. Hajipour, J.; Leung, V.C. Proportional fair scheduling in multi-carrier networks using channel predictions. In Proceedings of the 2010 IEEE International Conference on Communications, Cape Town, South Africa, 23–27 May 2010; pp. 1–5. [Google Scholar]
  46. Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998; Volume 2. [Google Scholar]
  47. Kelly, F. Charging and rate control for elastic traffic. Eur. Trans. Telecommun. 1997, 8, 33–37. [Google Scholar] [CrossRef]
  48. Chung, S.T.; Goldsmith, A.J. Degrees of freedom in adaptive modulation: A unified view. IEEE Trans. Commun. 2001, 49, 1561–1571. [Google Scholar] [CrossRef]
  49. Ouyang, W.; Eryilmaz, A.; Shroff, N.B. Downlink scheduling over Markovian fading channels. IEEE/ACM Trans. Netw. 2015, 24, 1801–1812. [Google Scholar] [CrossRef]
  50. Shmuel, O.; Cohen, A.; Gurewitz, O. Performance analysis of opportunistic distributed scheduling in multi-user systems. IEEE Trans. Commun. 2018, 66, 4637–4652. [Google Scholar] [CrossRef]
  51. Piazza, D.; Milstein, L.B. Multiuser diversity-mobility tradeoff: Modeling and performance analysis of a proportional fair scheduling. In Proceedings of the Global Telecommunications Conference, 2002. GLOBECOM’02, Taipei, Taiwan, 17–21 November 2002; Volume 1, pp. 906–910. [Google Scholar]
  52. Tokic, M.; Palm, G. Value-difference based exploration: Adaptive control between epsilon-greedy and softmax. In Annual Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2011; pp. 335–346. [Google Scholar]
  53. Gilbert, E.N. Capacity of a burst-noise channel. Bell Syst. Tech. J. 1960, 39, 1253–1265. [Google Scholar] [CrossRef]
  54. Elliott, E.O. Estimates of error rates for codes on burst-noise channels. Bell Syst. Tech. J. 1963, 42, 1977–1997. [Google Scholar] [CrossRef]
  55. Shifrin, M.; Cohen, A.; Weisman, O.; Gurewitz, O. Coded retransmission in wireless networks via abstract MDPs: Theory and algorithms. IEEE Trans. Wirel. Commun. 2016, 15, 4292–4306. [Google Scholar] [CrossRef]
  56. Biton, E.; Cohen, A.; Reina, G.; Gurewitz, O. Distributed inter-cell interference mitigation via joint scheduling and power control under noise rise constraints. IEEE Trans. Wirel. Commun. 2014, 13, 3464–3477. [Google Scholar] [CrossRef]
  57. Donthi, S.N.; Mehta, N.B. An accurate model for EESM and its application to analysis of CQI feedback schemes and scheduling in LTE. IEEE Trans. Wirel. Commun. 2011, 10, 3436–3448. [Google Scholar] [CrossRef]
  58. Duran, A.; Toril, M.; Ruiz, F.; Mendo, A. Self-optimization algorithm for outer loop link adaptation in LTE. IEEE Commun. Lett. 2015, 19, 2005–2008. [Google Scholar] [CrossRef]
  59. Morales-Jiménez, D.; Sánchez, J.J.; Gómez, G.; Aguayo-Torres, M.C.; Entrambasaguas, J.T. Imperfect adaptation in next generation OFDMA cellular systems. J. Internet Eng. 2009, 3, 202–209. [Google Scholar]
Figure 1. CQI average reporting KPI.
Figure 1. CQI average reporting KPI.
Mathematics 12 03352 g001
Figure 2. RSRP average reporting KPIs (a) in urban deployments and (b) rural deployments.
Figure 2. RSRP average reporting KPIs (a) in urban deployments and (b) rural deployments.
Mathematics 12 03352 g002
Figure 3. RSRP and SINR heat map. (a) Average RSRP reported in urban deployment that is lower than −116 [dBm]; (b) average SINR reported in urban deployment that is lower than 2 [dB]; (c) average RSRP reported in rural deployment that is lower than −116 [dBm]; (d) average SINR reported in rural deployment that is lower than 2 [dB].
Figure 3. RSRP and SINR heat map. (a) Average RSRP reported in urban deployment that is lower than −116 [dBm]; (b) average SINR reported in urban deployment that is lower than 2 [dB]; (c) average RSRP reported in rural deployment that is lower than −116 [dBm]; (d) average SINR reported in rural deployment that is lower than 2 [dB].
Mathematics 12 03352 g003
Figure 4. The supported rates for two users with L = 10 . A superimposed cross corresponds to an allocated slot. (a) Schedule according to the PF algorithm. (b) Schedule according to the predicted PF algorithm. (c) Sum log throughputs of PF and predicted PF at the end of each time slot.
Figure 4. The supported rates for two users with L = 10 . A superimposed cross corresponds to an allocated slot. (a) Schedule according to the PF algorithm. (b) Schedule according to the predicted PF algorithm. (c) Sum log throughputs of PF and predicted PF at the end of each time slot.
Mathematics 12 03352 g004
Figure 5. Performance comparison of 2 UEs with a cycle length of 5 slots (L = 5). (a) Sum log throughput. (b) Jain’s fairness index. (c) Average cell throughput.
Figure 5. Performance comparison of 2 UEs with a cycle length of 5 slots (L = 5). (a) Sum log throughput. (b) Jain’s fairness index. (c) Average cell throughput.
Mathematics 12 03352 g005
Figure 6. Performance comparison of 3 UEs across two scenarios with a cycle length of 5 slots (L = 5). (a) Sum log throughput. (b) Average cell throughput.
Figure 6. Performance comparison of 3 UEs across two scenarios with a cycle length of 5 slots (L = 5). (a) Sum log throughput. (b) Average cell throughput.
Mathematics 12 03352 g006
Figure 7. Performance comparison of 4 UEs across three scenarios with a cycle length of 5 slots (L = 5). (a) Average sum log throughput. (b) Average cell throughput.
Figure 7. Performance comparison of 4 UEs across three scenarios with a cycle length of 5 slots (L = 5). (a) Average sum log throughput. (b) Average cell throughput.
Mathematics 12 03352 g007
Figure 8. Performance comparison in a 2-UE scenario with transmission failures, for L = 5 and two different unsuccessful transmission setups (Setup I: P 1 s u c c = 1 , P 2 s u c c = 0.3 ; Setup II: P 1 s u c c = 1 , P 2 s u c c = 0.9 ). (a) Average sum log throughput. (b) Average cell throughput.
Figure 8. Performance comparison in a 2-UE scenario with transmission failures, for L = 5 and two different unsuccessful transmission setups (Setup I: P 1 s u c c = 1 , P 2 s u c c = 0.3 ; Setup II: P 1 s u c c = 1 , P 2 s u c c = 0.9 ). (a) Average sum log throughput. (b) Average cell throughput.
Mathematics 12 03352 g008
Figure 9. Testbed architecture illustration.
Figure 9. Testbed architecture illustration.
Mathematics 12 03352 g009
Figure 10. Implementation results for the 3-UE scenario with L = 5 . (a) Average sum log throughputs. (b) Average cell throughput.
Figure 10. Implementation results for the 3-UE scenario with L = 5 . (a) Average sum log throughputs. (b) Average cell throughput.
Mathematics 12 03352 g010
Table 1. Python-based simulation parameters and notations.
Table 1. Python-based simulation parameters and notations.
ParameterValue
Good (G) channel state rate r 1 4587 Bytes per RB (CQI = 15)
Bad (B) channel state rate r 2 225 Bytes per RB (CQI = 1)
Channel transition probabilities: G B P r o b G B = 1 p k
Channel transition probabilities: B G P r o b B G = 1 q k
Time slots per cycleL time slots
Table 2. Parameters and values attained by the PF and the RL-PF schedulers for an exemplifying cycle.
Table 2. Parameters and values attained by the PF and the RL-PF schedulers for an exemplifying cycle.
UE-1UE-2PF-SchedulerRL-PF-Scheduler
Slot CS-TB *CS-TB * PF(a → UE-1) PF(a → UE-2) Σ T UE 1 Σ T UE 2 Q(a → UE-1) Q(a → UE-2) Σ T UE 1 Σ T UE 2
1G-4587B-22512.163347.8137810016.7216.5100
2G-4587B-22513.1633419.977124587018.2417.9945870
3G-4587B-22520.9771220.97712458722519.7119.6391740
4G-4587B-22521.5620821.97712917422520.9121.06137610
5G-4587G-458722.5620825.46168917445021.9825.9813.761225
* CS-Channel State, TB-Transport Block91745037 137614812
Table 3. Q-table for the last three time slots following the initial allocations to UE-1. The marked cells indicate the UE selected based on the highest Q-value.
Table 3. Q-table for the last three time slots following the initial allocations to UE-1. The marked cells indicate the UE selected based on the highest Q-value.
UE-1UE-2UE-3RL-PF-Scheduler
Slot CS-TB *CS-TB *CS-TB * Q UE-1 Q UE-2 Q UE-3 Σ T UE 1 Σ T UE 2 Σ T UE 3
3G-4587B-225B-22526.5126.8526.72914700
4G-4587G-4587B-22527.6631.0728.4891472250
5G-4587G-4587B-22525.9826.3633.20914748120
* CS-Channel State, TB-Transport Block91744812225
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gurewitz, O.; Gradus, N.; Biton, E.; Cohen, A. Exploring Reinforcement Learning for Scheduling in Cellular Networks. Mathematics 2024, 12, 3352. https://doi.org/10.3390/math12213352

AMA Style

Gurewitz O, Gradus N, Biton E, Cohen A. Exploring Reinforcement Learning for Scheduling in Cellular Networks. Mathematics. 2024; 12(21):3352. https://doi.org/10.3390/math12213352

Chicago/Turabian Style

Gurewitz, Omer, Nimrod Gradus, Erez Biton, and Asaf Cohen. 2024. "Exploring Reinforcement Learning for Scheduling in Cellular Networks" Mathematics 12, no. 21: 3352. https://doi.org/10.3390/math12213352

APA Style

Gurewitz, O., Gradus, N., Biton, E., & Cohen, A. (2024). Exploring Reinforcement Learning for Scheduling in Cellular Networks. Mathematics, 12(21), 3352. https://doi.org/10.3390/math12213352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop