1. Introduction
The rapid advancement and expansion in the number of IoT devices result in an exponential growth in data, posing two key challenges to traditional centralized learning approaches [
1,
2]. Firstly, centralized methods relying on cloud-based architectures no longer fit the 4G/5G era due to the high storage requirements and communication costs involved in collecting data from millions or even billions of IoT devices (such as large amounts of time-sensitive and high-frequency data from drones or in-vehicle radar sensors). Secondly, user data privacy has become a sensitive topic.c. The regulations have been developed and enacted to ensure the protection of user privacy by many countries [
3]. Traditional centralized methods require uploading local data, which undoubtedly exposes user data privacy and puts users at risk of information leakage. Therefore, to improve efficiency and protect data privacy, providing distributed models and training methods for data-driven learning under the premise of privacy preservation has become a hot topic.
To meet this urgent need, Federated Learning (FL) has been proposed as a distributed machine learning paradigm [
4]. It enables multiple devices or organizations to collaboratively learn a shared model without compromising the privacy of local data. Specifically, each device trains a local model using its computational capacity and datasets, while a central server is responsible for maintaining global model updates. Through communication between the server and clients, the model accuracy converges to an optimal result. Without the need for centralized uploading of local data from clients, the distributed and collaborative characteristics of FL provide an effective mechanism for privacy preservation and model training.
Although progress has been made in improving the learning efficiency and privacy protection of FL, distributed training in heterogeneous edge networks still faces three key challenges. (1) Device mobility: Edge devices often move out of the communication area of the current base station during model training, which renders the global model in the previous round stale [
5]. Aggregation strategies should consider both heterogeneity and mobility, making optimal decisions based on the local model training results and device mobility. (2) Non-Independent and Identically Distributed (Non-IID) datasets: In real scenarios, the data distribution on most edge devices is diverse and personalized, resulting in variations in local training results and subsequently impacting the global mode. Designing appropriate aggregation strategies can mitigate the effects of Non-IID data to the maximum extent. (3) Long-deadlines: The difference in device computing power causes FL to need to wait for the slowest device to complete the training task, which not only incurring high computational and communication costs when processing such tasks but also results in huge latency.
To deal with the abovementioned challenges, it is an effective strategy to select appropriate local models to join the global aggregation in the each global round. In general, the design of the aggregation strategy needs to consider the real situation of the training device. By solving the resource scheduling problem to decide the aggregation strategy, ref. [
6] succeeded in energy consumption minimization. Also, a fair scheduling mechanism regarding the number of local iterations and resource allocation was introduced into the aggregation strategy [
7]. However, most of the existing works are based on the assumption of static scenes; device mobility is a significant aspect that cannot be overlooked in real scenarios. In a heterogeneous environment, we establish a Markov Decision Process (MDP) framework, which incorporates client heterogeneity and device mobility status to derive the optimal client participation decisions. Client heterogeneity is usually manifested in the difference in data distribution information, and the occurrence of mobility often makes the client offline from the current cluster server. The global aggregation takes a considerable amount of time when some devices are offline. The main contributions of this paper can be summarized as follows:
This article investigates the impact of device heterogeneity and mobility on the learning efficiency of FL in an edge environment. We propose OGAs to maximize the learning efficiency of FL. In addition, this article analyzes multiple possibilities for location changes of mobile devices when they participate in the global aggregation.
The OGAs use the MDP model to describe the interaction between clients and the server. In order to obtain the optimal strategy, an improved -value iteration algorithm is employed to address the MDP problem. In addition, to overcome the curse of dimensionality caused by the high-dimensional state space in the MDP model, dimensionality reduction is performed on the original features using the Principal Component Analysis (PCA) scheme, which utilizes the reduced feature space to represent the state set. This approach is simple and efficient, because it can improve efficiency without sacrificing learning accuracy.
Extensive simulations were conducted to evaluate the performance of the proposed strategy and investigate the impact of different environment settings on learning efficiency. The simulation results demonstrate that, compared to several other client participation strategies, the proposed strategy can significantly improve the learning efficiency of FL.
The remainder of this paper is organized as follows. The related work of the FL is introduced in
Section 2. And the motivating experiment is described in
Section 3.
Section 4 presents the system model. And in
Section 5, we provide a description of the OGA strategy.
Section 6 discusses the simulation results. Finally,
Section 7 concludes the paper.
2. Related Work
Recently, extensive studies of FL have been conducted in the literature. These works can be primarily categorized into the following three areas: (1) communication efficiency optimization; (2) computation efficiency improvement; (3) client selection/scheduling.
In FL, frequent communication between clients and the server led to significant communication costs [
8]. To improve communication efficiency, ref. [
9] proposed a bandwidth allocation and scheduling strategy that adapts to channel conditions and device computational capabilities. Ref. [
10] introduced an asynchronous FL framework that considers the differences in communication link quality, computational capabilities, and data distribution. This framework helps alleviate the issue of model staleness caused by asynchronous aggregation. Similarly, ref. [
11] proposed a two-stage algorithm to mitigate model staleness and conducted convergence analysis. Furthermore, ref. [
12] investigated the problem of minimizing FL communication latency and theoretically proved that the total delay is a convex function of learning accuracy. They utilized the method of bisection to obtain the optimal solution.
In improving computation efficiency, existing works tend to focus on designing adaptive solutions. Ref. [
13] proposed an adaptive control algorithm that dynamically adjusts the FL global aggregation frequency based on convergence bounds. To achieve better learning performance on Non-IID data, ref. [
14] introduced an experience-driven algorithm based on deep reinforcement learning (DRL) to adaptively determine hyperparameters during model training. Ref. [
15] extended the FedAvg algorithm and proposed an adaptive weighting strategy to mitigate the negative impact of Non-IID data on FL performance. Ref. [
16] designed an adaptive optimizer for sparse general gradients. Ref. [
17] developed a weighted aggregation heuristic algorithm that utilizes lossy compression techniques to reduce communication costs without compromising model accuracy.
Moreover, the client selection problem is a hot topic in improving FL efficiency. Joint client selection and resource allocation optimization were studied in Refs. [
6,
18,
19]. Ref. [
20] introduced a deadline-aware aggregation approach to set deadlines at specific stages of FL to aggregate as many clients’ local models as possible, making the overall training process more efficient. Ref. [
21] employed multi-armed bandit (MAB) techniques to adaptively determine clients for global aggregation and demonstrated its effectiveness. Ref. [
22] transformed a trade-off problem in FL client selection into an optimization problem and formulated it as a total communication cost minimization problem. Ref. [
7] utilized resource-aware techniques and availability to select clients.
Indeed, users are highly likely to be mobile in edge environments. However, most prior works have overlooked the impact of user mobility. To the best of our knowledge, there is limited research on the effects of device mobility on FL efficiency. Only Refs. [
23,
24] considered user mobility in the study of FL. The work of Ref. [
24] primarily focused on proving the convergence of FL in the presence of user mobility, while the work of Ref. [
23] overlooked the degradation of the global model due to the lack of effective training data caused by Non-IID data when selecting mobile vehicles for the aggregation process.
3. Motivating Experiment
Previous studies have shown that Non-IID data can slow down the convergence speed of FL due to the divergence in the distribution of local data samples [
25,
26]. However, these studies are just based on assumptions made in static scenario settings and without considering the possibility of device mobility. In this section, we design an experiment to investigate the impact of device mobility on FL model training. Specifically, we deployed 100 clients in a simulated environment. In each experiment, the clients were randomly migrated with different proportions in the environment, and the probability of the client migration can be modeled as a Markov chain [
27]. Each mobile client possessed 600 image data samples belonging to five different classes. The FL server selected mobile devices to ensure that a certain number of clients participated in each round of training, thus ensuring model convergence.
Considering device mobility, the aggregation of the current global round may generate disconnected clients accompanied by the access of unfamiliar clients, which undoubtedly poses challenges to the model training. On one hand, user mobility can lead to a reduction in the number of participants in each global aggregation round, which may result in missing training data and underfitting of the training model. On the other hand, user mobility can also reshape the data, helping to reduce the differences between clusters and constructing representative samples at each stochastic gradient descent (SGD) step of the global model.
Figure 1 and
Figure 2 illustrate the accuracy of FL in model training with different mobility and Non-IID ratios of clients. In these results, as the ratios of mobile devices and Non-IID increase, the training results tend to be less ideal. Moreover, it can be observed in the figures that the impact of client mobility on FL efficiency is far greater than the impact of Non-IID on FL efficiency. Therefore, this motivates us to design rational and efficient approaches to enhance FL efficiency.
4. System Model
We consider an edge system for FL consisting of
N mobile clients and
M edge servers
, where
N clients are randomly divided into
M clusters. Each client is indexed as
i,
, and each edge server is indexed as
j,
. A mobile client
i connects to the nearest edge server based on its real-time location to participate in the current round of global model aggregation. To describe mobility, we use
to represent the set of mobile clients in a cluster, noting that the size of
may vary with the number of mobile users. For convenience, we assume that multiple clients participate in the model training in each cluster (i.e.,
) to ensure that each cluster can independently complete the learning task (
Figure 3).
4.1. The FL Process
In general, FL is built upon a machine learning paradigm, where a typical machine learning task consists of a dataset D and a model parameter vector that is trained based on the data samples. In order to reach the desired objective, each client i trains a local model based on its local data samples . The cluster server j aggregates the local models from all nodes and maintains the updates of the global model.
For each set of data samples
from client
i, we define a loss function
, which captures the error between the predicted value
and the actual value
. For instance, in a linear model, the squared loss function
is commonly used. For all data samples
on client
i, the local loss function is defined as follows:
The global loss function on the distributed dataset from all devices can be defined as follows:
To minimize
, the SGD technique is commonly used to search for the optimal
. At each time slot
t, local updates based on gradient descent are performed according to the following rule:
where
is the learning rate, and
is the local model gradient.
4.2. The Mobility Model
Due to the mobility of the devices, the set of clients participating in the current round of global aggregation in each cluster is not fixed. We assume that all mobile clients are uniformly distributed across the entire network. And the mobile clients can randomly move to neighboring clusters with an certain probability. We use a Markov chain model to describe the migration trace of different mobile clients as in the same scenario in Ref. [
24]. Specifically, we use an indicator factor
to capture the connection status between client
i and cluster
at the round
r, when
indicates that the client
i is connected to the current cluster. Otherwise, it connects to a neighboring cluster.
Figure 4 describes the transition probability of mobile devices between different clusters. For the cluster topology of the entire network, we define a directed connection graph
, where
V represents the set of nodes and
E represents the set of edges. The adjacency matrix of the graph
G is defined as
, and its elements can be defined as follows:
where
represents the set of adjacent clusters of
, and the size of
can be calculated as
.
Before the global update, the mobile client
i in any cluster has two possible transition states, e.g., stay in the current cluster with a probability
, and moving to a neighboring cluster is
, as shown in
Figure 4. To simplify the model, we assume that a mobile client migrates to its adjacent cluster with a certain and same probability, and all mobile clients can randomly roam over the whole cluster space. The transition probability of the mobile device moving from cluster
to cluster
at the
r-round is defined as
, which can be computed as
when
, it indicates that the system is in a stationary state, which is equivalent to a conventional setup of FL [
4]. For the sake of generality, we assume that a client may always move about before the global round of aggregation, which facilitates the synchronous training of the global model.
4.3. The MDP Model
After all clients have completed uploading their local training results for the current round, the cluster server needs to immediately aggregate and broadcast the new global model. However, due to the heterogeneity of local data distributions and the uncertainty of client mobility, it is challenging to dynamically select a set of local models that will yield desirable results for the global aggregation in the current round. Moreover, in a distributed training environment, frequent and meaningless interactions between clients and the server can only increase communication costs and degrade model training results.
In this part, we employ the MDP to model the interaction between the cluster server and mobile clients, perceive the mobility and the knowledge of data distribution of the mobile clients, and optimize the client selection strategy to maximize the learning efficiency of FL.
By the MDP, the system remains in a certain state s, . The agent selects an action a, to take in the current state. After the action is performed, the agent receives an immediate reward R, and the system transitions to the next state s′ according to the transition probability . In the FL system, the decision of which clients participate in the global aggregation for the current round is based on the mobility, described as a standard MDP. The details are described as follows.
4.3.1. State Set
The state set
S includes all possible states in the FL system, including the clients’ computational capabilities, data distributions, and the moving trail of the devices.
S can be defined as
where
K and
D represent the local computing resources and local dataset information of the client
i, respectively.
L represents the geographical location (i.e., the possible cluster) of the client
i. Additionally, × represents the Cartesian product. These can be further described as follows:
Specifically, represents the local computational resources of the mobile client, which depends on the device’s computational capabilities. And, is a gradually increasing order that represents different clients’ computational resources. The dataset is derived from the client’s local data distribution. Since the training results of Non-IID data will ultimately reflect on the model’s learning accuracy, the system’s decision-making process in each round t aims to minimize the impact of Non-IID data. Lastly, the value of reflects the real-time location of the mobile client i. For example, in round r, the system state can be represented as , .
4.3.2. Action Set
The action set
A includes all possible actions that the system agent can take. Therefore,
A can be described as
where
and
. When
, it indicates that the client
i has not participated in the current global round of model updates. Otherwise, it has some actions. Based on the current state
s, the mobile client
i will make a decision by selecting a specific action
a.
4.3.3. Transition Probability
The transition probability
represents the probability of moving from the current state
s to the next state
by taking action
a. Assuming that the state transitions of each mobile client
i are independent of each other, the following equation holds:
The conditional probability
represents the probability of transitioning from the current state to the next state, where
,
, and
represent the local computing resources, data distributed information, and moving trail of the mobile devices in the current state. Also,
,
, and
represent the device information of the mobile clients in the next state. In this article, we assume that the data samples and computing resources of the mobile device do not change until the current global round finishes, so that
and
can be defined in a statistical pattern:
The conditional probability
represents the probability that the mobile device moves from cluster
L to next cluster
.
can be derived by
where the size of
indicates the probability of the mobile device staying in the same cluster in two sequential decision rounds, and the exact value can be computed by Equation (
5). Specially, the transition probability depends only on the action performed in the current state and has an affect on the expected reward of each client. The design of expected rewards will be introduced in the next part.
4.3.4. Reward Function
A reward function
describes an immediate reward for the mobile device when it selects an action at the current state
s. The completion time of various stages in FL describes the computational efficiency of the chosen action, aiming to achieve efficient training results in a shorter time scale. In the scenario of this paper, we assume that the migration of the device occurs during the model training of the device and ignore the migration time between two consecutive decision epochs. Here, the reward function is defined as follows:
where
represents the time cost for model communication (including upload and download). By deploying the cluster server on the base station, the mobile client can communicate with the server over a wireless network (such as a cellular network). Consequently, the model communication time can be computed as
. The
represents the data size of the local model, and the
is the wireless network transmission rate, which depends on the wireless network channel conditions. Furthermore, when the CPU frequency
of the mobile client
i is given, the model training time
can be calculated as
, where the
represents the total CPU cycles of the client
i for training the learning model. Additionally, the value of the migration time
of the mobile client can be obtained from the ratio of distance to velocity. In our simulation experiments, all clients were assumed to move at a 1 km/h speed.
By the MDP, the mobile client will choose an action within a specific period to connect to the cluster server in the current state and upload the local model. Based on it, the server can obtain the transition probability related to the current state and action of the client, and then give the action reward in the period. Similarly, the cluster server is responsible for maximizing the average reward, considering the uncertainty of client mobility and the indeterminacy of task completion time in each stage. Therefore, within the specific period, it is critical for the server to select a group of the optimal clients to join the model training and update in the current round. By analyzing the MDP over an infinite time horizon, we designed the optimal client selection strategy to solve this problem.
5. The Optimal Global Aggregation Strategy
In this section, we consider the problem of optimal client selection in each round of FL, which is formulated as an MDP over an infinite time horizon. By jointly considering the mobility of clients and knowledge of data distribution, the optimal decision can be derived for each round to achieve efficient learning results. Generally, based on the current system state, the MDP generates a policy
, which reflects the mapping between states and actions. The MDP ultimately finds the optimal policy
in a steady state, maximizing the expected discounted total reward. By introducing the value function
to describe the expected reward of policy
, the formula is defined as follows:
where
represents the expected function of the desired return for policy
,
is the discount factor, and
. Our goal is to solve the MDP over an infinite time horizon to obtain the OGA decision for each global round in FL. Therefore, the problem can be formulated as follows:
where
represents the stationary probability of the system in the steady state. In general, the MDP is a mathematical model of reinforcement learning that learns the mapping of states to actions to maximize the rewards.
The problem of solving the optimality equation of the MDP can be characterized as a dynamic programming problem, which is typically solved by using value iteration or policy iteration to seek the optimal solution in a polynomial time. Policy iteration can be time-consuming when solving multi-objective optimization problems, while plain value iteration without proper control may result in difficulty in convergence. Therefore, in our iterative process of solving the MDP, we set a stopping criterion
, where the iteration stops immediately when the following condition holds true:
where
represents the Euclidean norm, and
is a positive real number that ensures the convergence of the algorithm. Given a threshold value
for stopping the iteration, the
-value iteration algorithm can be executed by the cluster servers or the central cloud. Then, the mobile clients can compute the optimal policy based on their geographical locations and data distributions, including each state in
S and the corresponding action in
A.
We propose the -value iteration algorithm to solve the MDP and find the optimal decision that maximizes the efficiency of FL, as shown in Algorithm 1. The outer loop iterates based on the expected system performance, including the moving trail and data distributions, to derive the optimal decision and ensure the selection of a certain number of mobile clients for aggregation in each global round. The inner loop is used to update the value function, which consists of two steps. (1) Policy improvement: the policy is improved based on the previous round’s value function to obtain the greedy policy for the current round. (2) Policy evaluation: the greedy action in state is selected based on the greedy policy to update the corresponding value function.
Firstly, the algorithm initializes
A to an empty set and creates a probability transition matrix for all mobile clients. Then, the algorithm establishes the migration model for mobile clients and selects the corresponding client data distribution in each cluster. It computes the individual polling time for mobile clients based on
,
, and
. At this point, since the selected clients are unknown, the algorithm estimates a corresponding decision based on the real-time geographic location of the mobile clients and their local data distribution. After it, the iterative process begins. During each policy evaluation, the initial values of the value function are the value function from the previous iteration, obtained by solving the following Bellman’s optimality equation [
28]:
In practice, it is possible to obtain the optimal policy through value iteration before the algorithm converges. Therefore, it is necessary to change the termination condition of the value iteration algorithm. Specifically, when the current two estimates of the value function satisfy Equation (
18), it is time to terminate the iteration to speed up the program execution time.
Algorithm 1 Optimal Global Aggregation (OGA). |
Require: N, M, D, For each client. ES initializes globle model ; for client in parallel do Maintain the transition probability matrix ; if then Local model update with SGD according to (3); Send local model to ; else Download globe model Send local model to ; end if end for For Cluster Server. Initialize the sets ClientList[]; Iterate MDP export ; Calculating expected reward according to (15) for do end for Add the mobile client to ClientList[]; |
To ensure the efficiency of FL model training, it is preferable to include as many mobile clients as possible in the aggregation process. To prevent the training from deteriorating, clients in A that have the shortest migration path will be selected. Assuming that the local data distribution of all mobile clients does not change with client migration, the global round can be updated by adding and removing decisions from the decision set. This process is repeated until the model converges. Additionally, after a client migration, clients should update their local models based on the correlation between the local data distribution and the global data distribution. Therefore, the MDP decision can be made by jointly considering client migration and data distribution.
Dimensionality Reduction
Since the state space is combined with the local computing resources, data distribution, and geographical location of mobile clients, these results could occupy a huge state space. Moreover, the state space continues to grow over time, making it challenging to solve the MDP. To tackle the dimensionality challenge, the Principal Component Analysis (PCA) scheme is employed to perform dimensionality reduction on the original features and use the reduced information to represent the state set.
Since there is some correlation among the multiple-dimensional variables in the state space, such as the dependence of data distribution information on the geographical location of the client, it is possible to capture the information among the original variables using a smaller set of comprehensive variables known as principal components. Principal components are linearly independent variables obtained by the orthogonal transformation of the observed data represented by linearly correlated variables. The principal components are mutually uncorrelated, meaning that the represented information does not overlap with itself.
The PCA can be used to map high-dimensional data into a low-dimensional space by using a linear projection, with an expectation that the data are to be most informative (with the largest variance) in the projected dimension, thus using fewer data dimensions and retaining more of the characteristics of the original data. The specific process of the PAC can be described as follows. In the initialization phase, it is necessary to input the initial state space set and calculate the covariance matrix . After obtaining the input sample set, the data are normalized by de-centralization (such as Z-score normalization) in order to put the center of the data sample in the zero position. Subsequently, the projection matrix is constructed according to the feature size sequence , and the data are transformed into the new space constructed by the projection matrix. According to the maximum variance theory, the projection plane with a larger variance is selected to project the original feature, so as to ensure that the eigenvalue corresponding to each feature vector is the variance of the original feature projected to this projection plane, to achieve the purpose of dimensionality reduction.
A simple experiment was designed to prove the effectiveness of using PAC to reduce the dimension of the state space. On the basis of the experiment in
Section 3, a CNN model with 26050 parameters (simulating a high-dimensional state space) was trained on 100 edge devices, and the Non-IID data sample settings were the same as the experiment in
Section 3. After 50 local SGD training iterations, the normalized parameter weights were projected onto the new 2D space and the projection matrix continued to be constructed.
Figure 5 shows the feature distribution of parameters that trained on devices with different class labels. For example, all red “·” indicates a compressed version of the parameter weights from the device with label “2” during training, i.e., reduced to four dimensions from the original 26,050. In this case, the PAC can be proved to be effective and able to speed up the MDP solution.