1. Introduction
The Internet of Things (IoT) has permeated many aspects of our daily lives through the proliferation of smart services and applications empowered by artificial intelligence (AI) [
1]. HIS Markit estimates that by 2030, the number of IoT devices will reach 125 billion [
2]. The massive number of IoT devices generates a huge amount of data that is aggregated by the industry in the cloud. Machine learning (ML) has evolved over the last few years as an effective methodology for handling complex data [
3]. However, these IoT-generated data contain sensitive information about users or organizations, such as facial images, location information, health information, or fingerprint information. Uploading this information could result in information leakage. The issue of management and governance of data resources is the focus of many countries and government regulatory efforts. Countries and regions have introduced laws such as the General Data Protection Regulation (GDPR) [
4] in the EU, the Personal Data Protection Act (PDP) [
5] in Singapore, and the California Consumer Privacy Act (CCPA) [
6] and the Consumer Privacy Bill of Rights (CPBR) [
7] in the US to protect data privacy and security. This makes traditional centralized machine learning approaches, where distributed raw data generated on different devices or organizations are collected into a single server or a cluster with shared data storage, impossible.
Federated Learning (FL) is a novel distributed machine learning framework that allows user devices to benefit from models trained on rich data without the need to centrally collect these data. This approach safeguards the privacy and security of users. Federated learning comprises two parts: the edge mobile device (also known as the client) and the central server. In federated learning, the learning task is accomplished by a decentralized federation of clients that participate in federated learning, and these devices are coordinated by a central server. Each dataset of a client is locally trained and is not uploaded to the central server. Simultaneously, each client computationally updates the global model downloaded from the central server and sends the updated model back to the central server. The basic process of FL is described in [
8] and shown in
Figure 1.
First, the model is trained locally, and then the parameters of the model are sent to the server. Finally, the cloud server returns the aggregated model to the local device. The above steps are repeated throughout the federated learning process until convergence. FedAvg is a method of simply averaging the locally uploaded models on the server [
8]. Practically, FedAvg does not perform well when dealing with the complicated diversity of devices and data in the real world. Li et al. [
9] proposed a federated optimization, FedProx, which is used to settle heterogeneity in FL. FedProx improves stability by adding a proximal term to the local training subproblem on every local device. Unfortunately, such a method did not meet the needs of various applications and did not obtain an improvement in convergence speed. FedMA is a new layer-wise FL algorithm for modern CNNs and LSTMs that appeal to Bayesian nonparametric methods to adapt to heterogeneity in the data [
10]. However, this statistics-oriented aggregation method is susceptible to model poisoning attacks.
Primal-dual optimization method is a basic algorithm extensively employed in distributed learning. The alternating direction method of multipliers (ADMM) is a straightforward yet powerful approach that is particularly suitable for resolving distributed convex optimization issues, especially in the fields of applied statistics and machine learning [
11]. ADMM takes on the shape of a decomposition-coordination technique, where solutions to small local subproblems are coordinated to discover solutions to large global problems. ADMM depends on a user-selected penalty parameter (step size) as its foundation. Theoretically, ADMM converges to any constant penalty parameter [
12,
13,
14]; however, in practice, the efficiency of an ADMM is highly dependent on the selection of the parameter [
15] and can be enhanced using adaptive penalty selection methods [
16]. To summarize, taking into account the effectiveness of ADMM in distributed optimization, we have chosen to implement it in the context of federated learning (FL).
Several challenges differentiate optimization problems in federated learning, also known as federation optimization, from traditional distributed optimization. These challenges include user privacy, costly communication, and system and statistical heterogeneity [
9]. The user privacy challenge arises because although federated learning protects data on devices by sharing model updates, communicating these updates throughout the training process can expose sensitive information to third parties and the central server [
17]. The high cost of communication refers to the need for communication-efficient methods in the vast IoT network, where it is necessary to send small messages or model updates iteratively as part of the training process, instead of sending the entire dataset over the network, to adapt the model to the data generated by devices in the federated network [
18]. System heterogeneity refers to differences in the storage, computational, and communication capabilities of IoT devices participating in federated learning. Statistical heterogeneity refers to differences in the way devices collect and generate data, as well as differences in the amount of data across devices. Clients participating in federated learning are typically highly heterogeneous in terms of both the size of their local datasets and computational speed [
19]. Such challenges are common in real-world Internet of Things application scenarios. For example, there are many edge devices with computing power in the Internet of Things (these devices can be diverse, including wearable sensors, mobile phones, watches, and smart appliances). The data they generate are also diverse, and they are widely distributed across the Internet of Things; although they want to participate in federated learning training, they do not want their privacy to be compromised.
Expensive communication represents a significant bottleneck in federated learning, where a considerable number of clients are widely distributed, necessitating local data retention. In actual IoT communication networks, communication can be several orders of magnitude slower than local computation [
20,
21], thereby highlighting the urgent need for the development of communication-efficient methods. This approach considers two key aspects: (i) reducing the total number of communication rounds and (ii) decreasing the size of the transmitted information per round.
In this work, we propose a new federated optimization algorithm that combines adaptive consensus ADMM. This method successfully overcomes heterogeneity and converges in fewer communication rounds.
The main contributions of this paper are summarized as follows.
We propose and demonstrate a communication-efficient federated optimization algorithm, which performs well in heterogeneous settings. The proposed method uses an adaptive consensus ADMM method to select local parameters. Instead of estimating a single global penalty parameter for all clients, different local penalty parameters are generated for each client based on the local gradient of the subproblems.
We propose and adopt a new method similar to the double-rising and multiplier method to update the objective function of the customer. By using this method, the gradients of the server and the individual clients are aligned.
By comparing with baselines in multiple popular datasets, we demonstrate that our federated optimization algorithm improves communication efficiency and is robust to client heterogeneity, especially in highly heterogeneous situations.
The rest of the paper is organized as follows. In
Section 2, we provide an overview of related work. In
Section 3, we review the core algorithmic idea of FL, and give the description of the proposed framework, FedACADMM. The convergence guarantee for FedACADMM is derived in
Section 4 and the evaluation results are presented and analyzed in
Section 5. Finally,
Section 6 concludes the paper.
2. Related Work
Since FL was put forward [
8], it has been committed to solving the security problems caused by the diversity and complexity of communication networks and the difficulties in obtaining private data [
22]. In recent years, the research and discussion on federated optimization have been very active. Smith et al. [
23] presented a federated multi-task learning approach for heterogeneity, MOCHA, which takes into account communication cost, stragglers, and error tolerance. This method is not suitable for non-convex models. The objective function of a non-convex model is not a convex function, which is a function where the straight line connecting any two points on the function lies completely above the function. Therefore, non-convex models can have multiple local minima and maximums, and their optimization problems are more complex. In non-convex settings, FedAvg is shown to be effective, but does not perform very well in the real-world situation of device heterogeneity [
8]. FedProx modifies FedAvg by adding a weighted proximal term to the local objective function to account for heterogeneity [
9]. It might be challenging to choose the weighted proximal term in FedProx for complicated purposes. In addition, subsequent research found that the answer to the question of whether FedProx can improve the speed of convergence is unclear [
24]. A communication-efficient method, q-FFL, is inspired by techniques of fair resource allocation in wireless networks, and it alters the objective function of FedAvg by modifying the weights of different devices [
25]. It performs well in both convex and non-convex models.
Numerous studies have been published that discuss the function of ADMM in distributed machine learning [
11]. In [
26], this idea derived from the R-ADMM as an application of the Peaceman–Rachford splitting. It demonstrated a virtually certain convergence of the proposed algorithm under broad assumptions about the distribution of communication loss and node activation events by utilizing findings from operator theory. The authors of [
27] derived a perturbation strategy for ADMM in which the perturbed term is connected with the penalty parameters; it was shown that this simultaneously increases utility and privacy. The authors of [
27] derived a perturbation strategy for ADMM, where the perturbed term is linked to the penalty parameters, which increases the utility and privacy at the same time.
The primal-dual methods are also widely used in FL. Based on the Peaceman–Rachford splitting approach, [
28] developed FedSplit to solve the distributed convex problem and proved that FedSplit can solve FL problems under convexity only without any additional assumptions about the homogeneity of the system or data. FedSplit is only applicable to deterministic algorithms, excluding the random approximate gradient update method commonly utilized in practice. FedPD is a primitive dual-based meta-learning algorithm, which uses the dual variable to calculate the difference between local and server learning [
29]. We note that all clients in FedPD are continuously participating in training. Ref. [
30] considered both synchronous and asynchronous settings and applied the non-convex Douglas–Rachford splitting method, a random block coordinate strategy, and asynchronous updating to solve the non-convex composite optimization problem in federated learning. FedDR and asyncFedDR, unlike FedSplit and FedPD, only update a subset of customers at every round of updates and can update synchronously or asynchronously.
Our research was inspired by [
31,
32]. To increasing the robustness of federated optimization, in [
31], FedADMM adds the local objective function with dual and augmented Lagrangian terms. It should be noted that this method works extremely effectively on large-scale networks with diverse data. Ref. [
32] concentrated on a useful modification of the decomposition method by [
16], allowing the penalty parameter to alter automatically based on specific adaptive principles. Unlike the original method, this method automatically varies the penalty parameters within an acceptable range and constructs appropriate adaptive rules. Combining the above methods, we propose the following solutions: (i) select only a subset of clients to participate in each iteration of the local update, (ii) adaptively update the penalty parameters before the local update, and (iii) formulate a suitable self-adaptive rule.
4. Analysis
The goal of this section is to establish the global convergence for FedACADMM. We initially assert that the chosen stopping criterion is reasonable.
Let
be the sequence created by our approach in the following analysis. We try to prove that
converges to a solution point
. We rewrite the task of FL to find the optimal parameter
that minimizes the overall loss,
we also denote
, where
. The solution point
is a stationary point of (
12) if it satisfies,
It is not difficult to prove that any locally optimal solution to problem (
12) must satisfy (
13). If
is a convex function, then there is one and only one point that is a globally optimal solution when satisfying (
13). Moreover, a solution point
of (
12) indicates
So,
is also a solution point of
Let us consider the differences between
and
. Combine
with the third equality of (7) and perform some modifications to get:
We provide some lemmas to help prove the main convergence theory and the following lemma provides a desirable property of the last term of (
16).
Lemma 1. For , we have Proof of Lemma 1. For
, we have
In the previous local update:
Similarly, we can get:
We obtain by adding (
18) and (
23), and using the monotonicity of the operator
f:
The same can be obtained by adding (
19) and (
24):
Now that (
25) and (
26) have been combined, together with (
20), we have the statement of this lemma. □
So, we can obtain the inequality:
Lemma 2. We define , and therefore, . Assume , and we can obtain: Theorem 1. Let denote a solution to the variable inequality problem. Then, there exists a constant , if satisfied : Proof of Theorem 1. Combining (
28), (
29) and the Cauchy–Schwartz inequality, we obtain:
Assuming
and
is established, after a series of transformations, we obtain Theorem 1. □
Theorem 2. Let the sequence produced by our method, we obtain: Theorem 3. Let the sequence produced by our method, we obtain: Proof of Theorem 3. From
,
and
,
are bounded. Suppose
It follows from Theorem 1 that, for
,
As a result, there exists a constant
such that
Combine (
32) and (
37), we obtain
and then
This means that the sequence
is both an upper bound and a lower bound away from zero, which means that we have
and from (
30), Theorem 3 is proved. □
If the sequence
is generated by (
8), then it satisfies the following two conditions:
Condition C1. and
, where
Condition C2., where
In this case, we can obtain
So, it can be inferred that
Thus,
Therefore, the proposed method using the parameter-adjusting strategy in (
8) is convergent. Two sequences converge to the optimal function value of (
15), namely,
Any accumulating point
of sequence
is an optimal solution to (
12), where
is an optimal solution to (
15). Our algorithm implementation has been demonstrated to achieve both global and linear convergence, under the assumption that the Lagrange multipliers satisfy specific initialization conditions and provided that the local gradients are Lipschitz continuous.
5. Experiments
This section runs multiple numerical experiments to evaluate the performance of FedACADMM. We first describe our experimental setup including datasets, models, and data distribution in
Section 5.1. We then demonstrate the effect of modifying the penalty parameters and compare our proposed experiments with several baseline methods in
Section 5.2 and
Section 5.3.
5.1. Experimental Setup
Datasets and Models. In our study, we adopted the standard federated configuration where every client owns their local data and communicates with the central server to transmit and receive information. Our experiments involved handwriting recognition and image classification using three well-known benchmark datasets: MNIST [
35], Fashion MNIST (FMNIST) [
36], and CIFAR-10 [
37]. MNIST consists of handwritten digits from zero to nine. FMNIST includes images of 10 different categories of clothing, such as T-shirts, pants, and sneakers. CIFAR-10 is a dataset composed of 10 different types of objects, including aircraft, cars, dogs, and so on. To conduct these experiments, we utilized two CNN models, each comprising two convolutional layers, two fully connected layers, and a linear transformation layer. The detailed architecture of the models is provided in
Table 2.
Data Distribution. To investigate federated optimization and conduct comparative experiments, we adopted the data distribution settings used in the FedAvg approach. Specifically, we conducted experiments using two different data division methods: IID (homogeneous) and non-IID (heterogeneous). Under the IID method, clients receive an equal distribution of data. Under the non-IID method, we arranged the training data by labels and then distributed them evenly among the shards. Each client was randomly and evenly assigned two shards.
Validation metrics. To assess the generalization performance of our approaches, we applied the entire test set in well-known federated datasets. Because convergence speed and final performance are important indicators of federated learning, we computed the performance attained at a specified number of rounds and the number of communication rounds required for the algorithm to reach the test target accuracy. Typically, the ADMM approach converges to a reasonable accuracy within a few tens of iterations. Therefore, we selected one hundred experimental rounds for all datasets after preliminary estimation. We acquired a reasonable evaluation target accuracy displayed in
Table 3. For techniques that did not achieve the desired accuracy within the maximum communication round (
), we appended a plus sign (+) to the communication round. Additionally, we reported the communication round savings of FedACADMM compared to the classical FedAvg algorithm.
Implementation Details. We have implemented FedACADMM in PyTorch, which involves simulating a federated learning network with one server and m clients. To enable comparisons with other baselines, we reused the implementations of FedAvg and FedProx. All of our numerical experiments were conducted on a cloud computing platform, where we used a virtual machine with the following specifications: 4 vCPUs, 16 GB memory, and an NVIDIA RTX A4000 GPU. The operating system utilized for the training was Ubuntu 18.04 LTS, and we employed PyTorch 1.8.1 as our deep learning framework for training.
We set the global learning rate to 1 for all methods and used a client subset size equal to 10% of the total number of customers . For all algorithms, we utilized the SGD optimizer with a learning rate of 0.1 for local updates.
5.2. Choosing the Initial Penalty Parameter
When designing the evaluation method for FedACADMM, the first consideration in this paper was determining the appropriate initial self-adaptive penalty parameter, denoted as
, which is crucial for (
8). To achieve a balanced initial residual
and pairwise residual
, an empirical rule-based heuristic optimization hyperparameter method was employed. This is because if one of the two terms approaches zero gradually, the algorithm would converge slowly. Another consideration in proposing the rule is to prevent additional computation. In particular, a practical approach in practice would be to run FedACADMM with various values of
in parallel to obtain multiple final global models, and then select between these based on performance (e.g., accuracy) on the validation data. In this paper, we adopted an empirical rule-based heuristic optimization method to narrow down the search space and select a value for
from a limited set of candidates, namely
, following the experimental setups of [
9,
32].
Figure 2 displays a comparison of test accuracy curves and training loss curves across various initial value penalty values, denoted as
. For clarity, we omit the curves associated with poor experimental results when
. Our experiments demonstrate that FedACADMM is a robust approach for both homogeneous and heterogeneous data distributions. Furthermore, our results indicate that FedACADMM performs well within a limited number of communication rounds for different datasets, and is not highly sensitive to the choice of initial variable parameters
. FedACADMM
has better performance when the initial value is
.
Although it is not as good as
in most cases, its performance is very stable without large deviations, and it maintains robust convergence. Therefore, we chose
as the initial penalty parameter value for the comparison experiment with the baselines. When the initial penalty parameter value is
, the global model can effectively integrate local data and achieve robust and fast updates. It maintains a nice balance between
and
.
Figure 2d shows that
stalls, which is why we did not choose it as the initial penalty parameter value. The penalty parameter that is chosen will affect how quickly ADMM converges in practice.
When we choose the appropriate initial penalty parameter value, the proposed algorithm is robust and can achieve good results in both homogeneous and heterogeneous situations. In practical applications, the convergence speed of FedACADMM is dependent on the chosen penalty parameter. The actual convergence of FedACADMM is particularly sensitive to this parameter selection. For instance, when the initial penalty parameters are set to , the convergence is usually very slow. However, FedACADMM displays robustness to the choice of initial penalty parameters . When appropriate values of the initial penalty parameters are selected, the FedACADMM algorithm demonstrates strong robustness and achieves excellent model training results and fast convergence in both IID and non-IID data distribution cases.
5.3. Comparison with Other Baselines
Comparing the proposed algorithm with existing mainstream federal optimization methods is crucial for assessing model performance. However, due to the constraints of the current communication channels, reducing the communication overhead has remained a challenge for FL. Therefore, it is reasonable to anticipate that the proposed algorithm would minimize the number of communication rounds while maintaining good heterogeneity.
In the ensuing comparative experiments, we aim to contrast the performance of FedACADMM against that of FedAvg and FedProx. Specifically, we investigate their efficacy across three extensive empirical datasets, various data distribution techniques, and diverse client numbers. Our experiments generated promising results, which was really exciting.
Table 4 shows the communication rounds required to accomplish the test target accuracy under various conditions. In
Table 4, we report the number of communication rounds required by each of the three algorithms to attain the target accuracy for three distinct datasets, varying data distributions, and a range of client population sizes. Additionally, we provide a graphical representation of the results in
Figure 3, which facilitates a more insightful interpretation of the outcomes. The statistics of communication reduction by FedACADMM over FedAvg are shown in
Table 5. It can be observed that in most circumstances, the suggested algorithm FedACADMM can require less rounds than the baselines to attain the test target accuracy. To visually communicate our experimental findings, we have chosen to present a selection of representative results as images, as shown in
Figure 4. This method provides a clear and concise representation of the data. In
Figure 4a–d, the experimental results of each algorithm in the FMNIST dataset and the data distribution of IID are shown with different numbers of clients. We can observe that the performances of the three algorithms are quite good under homogeneous conditions, and the test accuracy of FedACADMM is slightly better than the other two algorithms. In
Figure 4i–l, the experimental results of each algorithm in the CIFAR-10 dataset and the data distribution of non-IID are shown with different numbers of clients. When the number of clients is small, the test accuracy of FedACADMM is almost the same as that of the other two algorithms, but its advantage becomes obvious as the number of clients increases; as such, we guess that it is suitable for large-scale communication. In
Figure 4e–h, the experimental results of each algorithm in the FMNIST dataset and the data distribution of non-IID are shown with different numbers of clients. It will not be repeated because its experimental results are similar to those of the CIFAR-10 dataset. In
Figure 5, the corresponding train loss of test accuracy in
Figure 4 is shown, and it can be seen that the convergence performance of FedACADMM is equal to the other two algorithms in most heterogeneous cases, and the convergence speed can be better than the other two algorithms in homogeneous cases. In
Figure 6, we selectively display the test losses for some of the experiments that are shown in
Figure 4.
Figure 7 displays the confusion matrix, which indicates the performance of the predicted classes. As shown, the proposed model achieves high accuracy in classifying each dataset.
Through the detailed analysis above, we can see some advantages of FedACADMM:
Communication efficiency. Under the homogeneous data distribution method, almost all experimental methods can quickly achieve the test target accuracy, and the increase in the number of clients enables the advantages of our proposed method to be reflected. However, a homogeneous experimental setup is not informative for real-life situations. On the other hand, in a heterogeneous experimental environment, the proposed method gradually exhibits robustness with the complexity of the environment. Our proposed method successfully handles the client drift problem in diverse environments and is optimized for the local training problem.
Data Heterogeneity. Across all range of values tried, we observe that the proposed method achieves test target accuracy almost faster than baselines in both homogeneous and heterogeneous experimental settings. As the number of customers increases, the performance gap between FedACADMM and baseline gradually widens. Moreover, client sampling has a relatively small impact on the proposed method when compared to baseline experiments. With no need to manually adjust the number of local epochs, our method solves the problem of statistical heterogeneity by limiting the local updates to be more similar to the global model.
5.4. Global Collecting Step Size
As we note in
Section 5.2, although
is desirable in most cases, it is not satisfactory for FMNIST in the 1000-client and non-IID data distribution settings. In subsequent experiments, we study the convergence of the model for tuning the server cloth length under a data distribution of 1000 clients and non-IID data distributions. The reduction in data heterogeneity and random algorithm turbulence that can affect experimental results can be achieved by the global collecting step size
, as demonstrated in (
11).
So we fix
and alter
. As shown in
Figure 8a,
gives better test accuracy and robust convergence with
for FMNIST in the 1000-client and non-IID data distributions setting. When the value of initial variable parameter
is small, reducing
synchronously to a rational range promotes better convergence. To further validate our assumption, we conduct experiments with the condition
, which was excellent in the previous experiment of changing
. It can be concluded from
Figure 8b that
is the best choice at
, and the test accuracy quickly reaches the optimal value and the convergence is robust and stable.
Optimizing the global collecting step size is critical for mitigating data heterogeneity and random algorithmic turbulence that could impede experimental outcomes. An appropriately selected global collecting step size ensures efficient information exchange between clients and the server, leading to improved aggregation at the server. An effective combination of and can make the algorithm experiment more effective.
6. Conclusions
In this paper, we discuss a practical scenario for federated learning that is characterized by a large number of clients with heterogeneous data and limited participation constraints. This scenario poses a challenge for achieving satisfactory convergence and model performance. To address this problem, we propose FedACADMM, a new federated objective optimization algorithm inspired by the adaptive consensus ADMM strategy in distributed optimization. Our proposed approach minimizes the communication cost per round, requiring storing additional variables at each client. This leads to a significant speed improvement as compared to the state-of-the-art model while reducing communication costs. We conducted extensive experiments on realistic datasets, and the results demonstrate that the algorithm is suitable for joint learning scenarios involving a large number of clients with heterogeneous data and limited participation constraints. Furthermore, the proposed algorithm exhibits good robustness and resilience and can achieve the target accuracy faster than the original methods while improving communication efficiency.
Federated learning has gained significant attention due to its privacy-preserving effect, achieved by uploading only the model without the raw data, and it is a rapidly growing research area. Recently, there has been increasing interest in characterizing the convergence rate of both centralized and distributed implementations of ADMM. In optimizing communication efficiency, the trade-off between reduced communication and accuracy is a critical focus of federated learning. The two most commonly studied solutions in federated optimization are bulk synchronous approaches and asynchronous approaches, where the delay is assumed to be bounded. In this paper, we only considered the bulk synchronous approach. However, in the future, we plan to consider more realistic device-centric communication schemes where each device can decide to communicate with the central server at its convenience, i.e., the asynchronous update approach.