1. Introduction
With the rapid development of the big data era, the networked devices in modern distributed networks generate abundant data every day. Research related to deep learning has also experienced explosive growth due to the massive amount of high-quality data samples. Making full use of these highly expressive data can help to construct more complex and accurate machine learning models. However, in practical applications, the issue of data privacy protection is involved. In the process of centralized learning, data need to be uploaded to cloud servers or data centers, which may result in unauthorized access, theft, and leakage of data. In addition, some organizations and individuals with data security protection needs, such as governments and hospitals, may not be able to accept uploading data to a shared platform such as a public cloud, thus restricting their power.
Spawned by the aforementioned issues, researchers have started to gradually shift their focus from data aggregation to model aggregation. Storing data locally and pushing network computation to the edge are becoming increasingly attractive. Federated learning (FL) is a distributed artificial intelligence framework that enables multiple edge devices (such as mobile phones and wearables) to collaboratively train a shared model. In the FL algorithm, edge devices complete the model training process by coordinating with a central server [
1,
2]. Federated learning also provides valuable insights and potential solutions to data privacy and security challenges in the rapidly evolving field of smart UAV delivery systems [
3]. Federated learning addresses the concern of transmitting private information, enables multiple parties to participate in training while protecting data privacy, and solves the problem of data silos.
However, in a federated learning setup, the data are distributed unevenly across the edge devices, resulting in an uneven distribution of data samples. In particular, the local dataset used for training by each edge device is not only different in size but also may contain non-IID data samples. This means there is a data heterogeneity problem, which can lead to a decrease in the accuracy of the model obtained from the federated learning training, or even cause the training to fail in reaching convergence [
4].
Within the IID case, it is observed that the difference between the local scatter weight, which indicates each participant’s contribution to the global model aggregation, and the average scatter weight of the central server is minimal. However, in the non-IID case, the gap between the local scatter weight of a client and the average scatter weight of the central server widens with the number of iterations due to data distribution issues.
To address the above problems, we propose identifying a shared parameter space for every client model. By imposing restrictions on the gradient during the client model training and complementing it with this shared parameter space, we can effectively minimize the mutual interference among client models. In this work, our contributions are summarized as follows:
To address the impact of data heterogeneity on model performance in federated learning, we propose a personalized federated learning method called PerFreezeClip. PerFreezeClip employs freezing and gradient clipping methods to parallelize training and local adaption on the client side, effectively resolving mutual interference among client models.
We investigate the use of freezing methods to control knowledge transfer between different devices in client-side training and show that by freezing the parameters of certain sub-networks, it is possible to limit the sphere of influence of specific sub-networks, preventing the over-dependence on information from other devices, and thus enabling more accurate knowledge sharing and maintaining localized features (i.e., personalization)
We investigate limiting the updating of the gradient in the face of data heterogeneity and show that by limiting the updating range of the gradient, we can balance the updating of the weights of global and local information, control the model bias, maintain the consistency of the globally shared parameters, and improve the generalization ability of the model.
Simulation experiments and performance evaluations of the PerFreezeClip method were conducted on multiple datasets. The experimental results show that PerFreezeClip outperforms personalized methods like FedRep on CIFAR10 and CIFAR100 datasets.
3. The Principle of Perfreezeclip
This paper introduces PerFreezeClip, a novel method aimed at mitigating the impact of non-IID (non-identically independently distributed) data on federated learning (FL) models. Specifically, PerFreezeClip incorporates adaptive clipping (adapt_clip) to address this challenge. The workflow of PerFreezeClip is illustrated in
Figure 1. This study employs a freezing method and a gradient clipping strategy to achieve its objectives. PerFreezeClip adopts a gradient clipping technique, although it may incur additional communication overhead [
21,
22]. However, the introduction of a freezing method for freezing the trunk and head can compensate for the computational overhead incurred in this part. Related work [
23,
24] has shown that freezing methods can reduce the computational and communication resources required to train learning models in FL. Therefore, this work is not from the perspective of communication overhead but focuses on performance on heterogeneous data. Initially, the global model is distributed from the server to all clients for local training. During local training, the global model is divided into a backbone part and a personalized head. There is usually a division of the model hierarchy, with the backbone part usually containing the bottom and middle layers of the network, which are used to extract generic and universally applicable feature representations, whereas the header is usually located at the top layer of the network or the last layers for a specific task and is used to adaptively tune the model to fit individualized needs. Each client freezes the backbone part and updates the personalized head. Once convergence is achieved, they freeze the head and update the backbone part. Throughout these client model updates, gradient clipping is applied to limit the gradient model. At the end of local training, the client model is passed to the server for aggregation of the backbone parts.
The general form of federal learning is expressed as
where
represents the loss function of the
client,
.
denotes the data distribution for the
client. Here,
, where
represents a set of clients participating in the training process. For each
and
, it is assumed that there is access to an unbiased stochastic gradient
of the client’s true gradient
.
In the context of heterogeneous data, where
, the data distributions
and
may exhibit notable disparities. Unlike conventional federated learning, our goal is to freeze some of the parameters during the model training and aggregation phases, to improve the stability of the federated learning process, and to obtain a model that is more suitable for individual device customization. Meanwhile, gradient clipping in our algorithm aims at mitigating the effects caused by differences in data distribution and quality across devices, adjusting the performance of the gradient during the training process, and making our algorithm more suitable for each device’s characteristics, resulting in distinct models
for
. Drawing upon FedRep [
20], we decompose each learning model
into two components: a global representation model
, which maps data points to a lower-dimensional space of size k, and a device-specific head model
, where
. Here,
and
denote the dimensions of the global representation and local head models, respectively. To facilitate personalized learning on each device, we employ a strategy where
and
are frozen in different stages during the training process, and their parameters cannot be updated after freezing.
3.1. The Freezing Procedure
Algorithm 1 illustrates the entire freezing procedure. Suppose that during the t-th round of training,
represents the model on the i-th client, where
. Initially,
was received from the global model,
. The decoupling process is represented as
, where
denotes the decoupled global representation model and
denotes the personalized head model. The freezing of
and
is managed in different stages in the training process based on the freezing scale parameter
, as indicated by the following equation:
where
denotes the model parameters after the freeze is completed. Here,
represents the model of the
i-th client in t rounds of training. The parameter
denotes the freezing scale parameter set for the global representation
, serving as a hyperparameter. Meanwhile,
and
signify the respective numbers of layers before the freezing of
(global representation) and
(personalized head). The freezing of
and
is regulated throughout various local epochs in accordance with
. In lines 1–4 of Algorithm 1, in the beginning epoch of local training, it is first determined whether to freeze
. If the last epoch of freezing
is reached, then the unfreezing of
is performed, as indicated in lines 11–13. At this point, the transition is made to freezing the head
, indicated in lines 7–10.
Algorithm 1 |
Input: Output: //Freeze completed 1: if :// local_epoch denotes the number of rounds of local training 2: for in do: 3: final // denotes an immutable set of parameters 4: set 5: end for 6: else 7: for in do: 8: final // denotes an immutable set of parameters 9: set 10: end for 11: if : 12: set 13: //Freeze completed 14: end |
The goal of freezing
implemented in the local training process is to maintain the static attributes of these layer parameters. This strategy preserves the feature capabilities acquired from prior training, while exclusively updating the weights of the head layer (personalized parameters). As a result, this Freeze method enhances the exploration of the optimization space for the remaining layer parameters during subsequent training, that is, the header layer parameter, thereby fostering personalized learning.
The process of freezing
during global aggregation serves to facilitate knowledge exchange among clients independent of personalized parameters:
In this context, and represent a fixed set of parameters, where and . This signifies the immutable nature of the parameters, indicating that they are frozen and not subject to updates. Note that it is common to assign a higher value to . Specifically, in cases involving heterogeneous data with distinct data distributions, this configuration leads to a reduced percentage of freezing for , allowing for increased local adaptive updates within the head.
3.2. Detailed Procedure of Adap_Clip
After completing the freezing process and starting local training, gradient clipping is applied before updating parameters. In contrast to traditional approaches that employ fixed-value gradient clipping methods, we proposed an adaptive gradient clipping algorithm referred to as adap_clip, which is inspired by the methodology outlined in [
25]. adap_clip dynamically calculates the clipping threshold based on recent gradient trends, offering improved adaptability to diverse tasks, data, and models, and enhancing the stability of personalized head parameter updates. Distinguishing it from the method proposed in [
25], our adap_clip incorporates a hyperparameter
to impose a global hard limit, enabling adjustments to the threshold based on the distribution of gradient trends. This adaptive mechanism ensures that the clipping operation aligns with the current gradient conditions. The adap_clip principle is underpinned by the following observations: (1) The distribution of gradient trends commonly exhibits a long-tailed pattern, with the majority of gradients being small and only a small fraction being large. (2) Parameters associated with smaller gradients predominantly contribute to stabilizing model training, while a subset of parameters with larger gradients may introduce instability, oscillation, or divergence during training.
The adap_clip algorithm predicts the variation in gradient in the current iteration by utilizing historical gradient trends to establish a reasonable gradient boundary. The process is outlined as follows:
Compute the L2 norm of the gradient for each iteration and designate T, the percentile of the historical gradient’s L2 norm, as the clipping threshold for the current iteration.
where
represents the L2 norm of the gradient in the
t-th iteration, while
denotes the gradient clipping threshold in the
t-th iteration.
signifies the batch data selected during the
t-th iteration,
is the batch size chosen for each iteration, and
corresponds to the clipping threshold selection percentage
. The first
% of the gradient norm is chosen as the clipping threshold
.
Following the dynamic computation of the clipping threshold
, a hyperparameter
is introduced as a fixed threshold to restrict the maximum value of gradient norms. Any gradient norm that exceeds this threshold is scaled to match the specified norm. The final clipping threshold is determined by the combination of these two thresholds. While
imposes a global hard limit,
is adjusted based on the distribution of the gradient norm, making the clipping operation more aligned with the current gradient scenario.
This approach ensures that the clipping threshold aligns with the gradient norm, thereby effectively managing the gradient magnitude. When confronted with a large gradient norm, the dynamically calculated clipping threshold is used for clipping. Conversely, in scenarios characterized by small gradient norms, the gradient norms themselves are used as the clipping threshold, providing enhanced flexibility to accommodate diverse gradient clipping needs.
The gradient of each data point within every training batch will be subject to clipping. The averaged clipped values of all gradients in that batch will represent the gradient for that iteration.
By leveraging the adap_clip algorithm to flexibly determine the clipping threshold, our PerFreezeClip algorithm can dynamically execute gradient clipping for various models and tasks, thereby enhancing model stability and generalization. This adaptive approach offers greater flexibility in addressing diverse gradient distribution scenarios, as opposed to relying solely on a fixed threshold.
3.3. Pseudocode for PerFreezeClip
Initially, PerFreezeClip selects a set of clients
from
, where
and the sampling. Meanwhile, the server holds model
. PerFreezeClip, as shown in Algorithm 2, consists of two primary steps: local training and server aggregation. It incorporates freezing and clipping methods during the local training phase. In Algorithm 2, lines 5–17, the process of local training for each client is described. In line 6, each local model decides whether to freeze the global representation
or the personalized head
at the current epoch, based on the epoch it is currently in and the freezing scale parameter
. The detailed process of freezing is described in Algorithm 1. The process of updating and cropping is described in lines 9–14, where for each round of local training, adaptive gradient cropping is imposed. Line 19 describes the process of aggregating
. The update operation for local training is illustrated by Equation (9):
The optimizer
, representing a gradient-based optimizer (e.g., SGD) with a learning rate of
, describes the update process for lines 13 and 14 in Algorithm 2. Here,
signifies the clipped gradient. Furthermore, beyond the final round encompassing the update of all parameters, the specific parameter updates are contingent upon the frozen state during various epochs of the local training. In each round of local training, a freezing procedure (Algorithm 1) is first entered. At the beginning of training, the trunk will be frozen according to the freezing ratio parameter
. This hyperparameter will determine at how many epochs the backbone will be frozen. At this time, the weights of the head layer will be updated to allow the head to adapt to the local data quickly. At the same time, in the update of the head layer, gradient clipping will be used to limit the gradient to ensure the stability of the overall gradient. In the frozen state
, the parameter updates adhere to Equation (10):
Conversely, in the state
, the updates are governed by Equation (11):
Global aggregation involves amalgamating all updated models after completing local training, following FedAvg’s standard aggregation procedure. At this juncture,
, wherein
signifies the locally trained model for each client
, and
denotes the global model for the next iteration. The process is summarized in Algorithms 1 and 2.
Algorithm 2 |
Input: Initialize model parameters , learning rate , client set S, number of clients involved in model training , number of communication rounds , number of local iterations , size of local training batch B, percentile of gradient clipping threshold selection T, freezing ratio for global representation of , maximal value of norms 1: for do 2: server Send to all clients 3: for each client in parallel do 4: //initialization 5: for do// Start training 6: //denote the number of and layers before freezing 7: for do// is a randomized non-repeating batch of data from 8: for do 9: //Calculate the gradient 10: //Adaptive clipping threshold selection 11: 12: 13: //update 14: //update 15: end for 16: end for 17: end for 18: end for 19: // freezing state 20: Randomly select a subset of clients without repetition and send the updated model to all clients in this subset 21: end for |
4. Theoretical Analysis of Gradient Clipping
In heterogeneous data environments, we contend that during each communication round involving personalized learning on devices, the local model is updated towards the local optimum, which may diverge significantly from the global optimum. As a result, the average model may also deviate from the global optimum, especially in scenarios with significant local updates. These updates actually move along the gradient direction. Thus, a straightforward method to regulate these updates is by restricting the gradient, thereby constraining the gradient range within the personalized context. In addressing this issue, we counteract overfitting in local updates by applying a gradient clipping technique during parameter updates to confine the gradient range. The subsequent analysis is detailed below:
During the aggregation step,
remains in a frozen state, indicated by
=
, where
denotes an immutable set of parameters. Notably,
does not participate in the aggregation update. We can reframe the FedAvg update as follows:
When it is evident that
in
consistently remains uninvolved in the update, it can be interpreted as
. Another reformulation of Equation (13) is presented below:
where we define
to represent the update of the
i-th client. Subsequently,
is introduced to denote the global update obtained by averaging the updates of the global representation of all clients (i.e., the difference between the global model
and the global model
from the previous round). Given that
is in a frozen state and does not participate in the update,
and
. Here,
. The parameter update can be denoted as
and
.
Based on the aforementioned definition, it is evident that functions as the pseudo-gradient of the applied stochastic gradient descent (SGD) algorithm, guiding the server’s updates to the global representation . This definition clarifies that after using SGD on the client, additional operations can be conducted to manage the pseudo-gradient , including gradient clipping. As a result, we adaptively scale the gradients of the parameters before updating them. Furthermore, we posit the following assumptions:
Assumption 1 (Lipschitz Gradient) [
26]
. The function is L-smooth for all , i.e., , for all
.
Assumption 1 presents a restriction on the gradient of the loss function for each client, referred to as the Lipschitz Gradient condition. Let us consider a set of clients, each associated with a loss function , where denotes the client index and m represents the total number of clients. The parameter L in this assumption represents a positive constant that denotes the Lipschitz constant of the gradient for each client. It is postulated that the gradient of the loss function for each client complies with the following property: for all (where d signifies the dimension of the real number space), the product of the disparity in the norms of the gradient and the distance between the variables x and y does not surpass L. Here, denotes the L2 norm (Euclidean norm).
We argue that the gradient of the loss function does not change too drastically for each client, and that the rate of change in the gradient is globally limited to a constant L. This is a smoothing requirement that ensures the gradient of the loss function does not vary too much in the local region, which is beneficial for the stability and convergence of the optimization algorithm. Therefore, the Lipschitz Gradient assumption ensures that PerFreezeClip can find an L (threshold) for limiting the range of gradient updates among several local adaptation steps. It indirectly satisfies the condition of Lipschitz continuity by setting the threshold for gradient clipping. The Lipschitz Gradient (Gradient clipping) ensures that the rate of change in the gradient is somewhat limited, which helps to enhance the convergence speed and stability of PerFreezeClip.
Hence, to correctly apply gradient clipping and satisfy Lipschitz continuity, it is important to wisely choose the threshold for gradient clipping. Setting the threshold too small may cause the gradient to be over-clipped, affecting the convergence performance of the model, while setting the threshold too large may not satisfy the Lipschitz continuity condition. A wise choice of threshold for gradient cropping can be a combination of the following four aspects. (1) Initial estimation based on the model and the task; the typical range of gradient can be estimated through preliminary experiments or based on previous experience with similar tasks. (2) Based on the statistical analysis of the gradient, the statistical properties of the gradient, including the mean and standard deviation, are monitored during the training process. (3) Assemble model architectures and optimizers, which may have different sensitivities to gradient tailoring. (4) Experimental validation, where a series of experiments are conducted to validate the model’s performance after the threshold is set. Therefore, we introduce the hyperparameter for a global hard limit, which is analyzed by the following assumptions:
Assumption 2 (Bounded Gradients) [
27]
. The function has G-bounded gradients; for any , and , we have for all
. Assumption 2 involves a restriction on the gradient of the loss function on each client and all dimensions and is known as the Bounded Gradients condition. Specifically, assume that for each client ( denotes the index of the client), the gradient of the loss function is bounded by the absolute value of the gradient in all dimensions for any and ( denotes the set of some random variable). That is, , where G is a positive constant representing the upper bound of the gradient of each client’s loss function in all dimensions.
We argue that the gradient of each client’s loss function does not exceed G in each dimension. This assumption ensures that the gradient of each client’s loss function is not too large in each dimension, constrained by a global upper bound G. The gradient boundedness condition in Assumption 2 can be seen as a formal representation of gradient clipping. The hyperparameter we introduce is the upper bound G on this global restriction. This can help us to better limit the update of the gradient.