Next Article in Journal
Chaotic Optical Communication with Wavelength-Hopping Technology Based on Tunable Lasers
Previous Article in Journal
A Phishing-Attack-Detection Model Using Natural Language Processing and Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Communication-Efficient Federated Learning with Adaptive Consensus ADMM

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(9), 5270; https://doi.org/10.3390/app13095270
Submission received: 4 March 2023 / Revised: 11 April 2023 / Accepted: 20 April 2023 / Published: 23 April 2023

Abstract

:
This paper proposes utilizing federated learning (FL), a distributed learning paradigm, to process large, decentralized, and heterogeneous edge data in the context of Internet of Things (IoT) devices. However, heterogeneity and high communication costs are two primary challenges that hinder the efficacy of federated learning. To overcome these challenges, we have designed an algorithm, FedACADMM, which applies the adaptive consensus alternating direction method of multipliers (ACADMM) to federated learning clients (i.e., the edge mobile devices) to tackle the heterogeneity problem in federated networks. Importantly, the cost per round of communication for FedACADMM remains consistent with FedAvg and FedProx without adding any extra workload. Furthermore, our experimental results demonstrate that FedACADMM outperforms baseline methods with a realistic set of federated datasets, displaying enhanced convergence robustness. Notably, in highly heterogeneous scenarios, FedACADMM exhibits significant stability and rapid convergence, requiring at least 75% fewer communication rounds to achieve the target accuracy.

1. Introduction

The Internet of Things (IoT) has permeated many aspects of our daily lives through the proliferation of smart services and applications empowered by artificial intelligence (AI) [1]. HIS Markit estimates that by 2030, the number of IoT devices will reach 125 billion [2]. The massive number of IoT devices generates a huge amount of data that is aggregated by the industry in the cloud. Machine learning (ML) has evolved over the last few years as an effective methodology for handling complex data [3]. However, these IoT-generated data contain sensitive information about users or organizations, such as facial images, location information, health information, or fingerprint information. Uploading this information could result in information leakage. The issue of management and governance of data resources is the focus of many countries and government regulatory efforts. Countries and regions have introduced laws such as the General Data Protection Regulation (GDPR) [4] in the EU, the Personal Data Protection Act (PDP) [5] in Singapore, and the California Consumer Privacy Act (CCPA) [6] and the Consumer Privacy Bill of Rights (CPBR) [7] in the US to protect data privacy and security. This makes traditional centralized machine learning approaches, where distributed raw data generated on different devices or organizations are collected into a single server or a cluster with shared data storage, impossible.
Federated Learning (FL) is a novel distributed machine learning framework that allows user devices to benefit from models trained on rich data without the need to centrally collect these data. This approach safeguards the privacy and security of users. Federated learning comprises two parts: the edge mobile device (also known as the client) and the central server. In federated learning, the learning task is accomplished by a decentralized federation of clients that participate in federated learning, and these devices are coordinated by a central server. Each dataset of a client is locally trained and is not uploaded to the central server. Simultaneously, each client computationally updates the global model downloaded from the central server and sends the updated model back to the central server. The basic process of FL is described in [8] and shown in Figure 1.
 First, the model is trained locally, and then the parameters of the model are sent to the server. Finally, the cloud server returns the aggregated model to the local device. The above steps are repeated throughout the federated learning process until convergence. FedAvg is a method of simply averaging the locally uploaded models on the server [8]. Practically, FedAvg does not perform well when dealing with the complicated diversity of devices and data in the real world. Li et al. [9] proposed a federated optimization, FedProx, which is used to settle heterogeneity in FL. FedProx improves stability by adding a proximal term to the local training subproblem on every local device. Unfortunately, such a method did not meet the needs of various applications and did not obtain an improvement in convergence speed. FedMA is a new layer-wise FL algorithm for modern CNNs and LSTMs that appeal to Bayesian nonparametric methods to adapt to heterogeneity in the data [10]. However, this statistics-oriented aggregation method is susceptible to model poisoning attacks.
Primal-dual optimization method is a basic algorithm extensively employed in distributed learning. The alternating direction method of multipliers (ADMM) is a straightforward yet powerful approach that is particularly suitable for resolving distributed convex optimization issues, especially in the fields of applied statistics and machine learning [11]. ADMM takes on the shape of a decomposition-coordination technique, where solutions to small local subproblems are coordinated to discover solutions to large global problems. ADMM depends on a user-selected penalty parameter (step size) as its foundation. Theoretically, ADMM converges to any constant penalty parameter [12,13,14]; however, in practice, the efficiency of an ADMM is highly dependent on the selection of the parameter [15] and can be enhanced using adaptive penalty selection methods [16]. To summarize, taking into account the effectiveness of ADMM in distributed optimization, we have chosen to implement it in the context of federated learning (FL).
Several challenges differentiate optimization problems in federated learning, also known as federation optimization, from traditional distributed optimization. These challenges include user privacy, costly communication, and system and statistical heterogeneity [9]. The user privacy challenge arises because although federated learning protects data on devices by sharing model updates, communicating these updates throughout the training process can expose sensitive information to third parties and the central server [17]. The high cost of communication refers to the need for communication-efficient methods in the vast IoT network, where it is necessary to send small messages or model updates iteratively as part of the training process, instead of sending the entire dataset over the network, to adapt the model to the data generated by devices in the federated network [18]. System heterogeneity refers to differences in the storage, computational, and communication capabilities of IoT devices participating in federated learning. Statistical heterogeneity refers to differences in the way devices collect and generate data, as well as differences in the amount of data across devices. Clients participating in federated learning are typically highly heterogeneous in terms of both the size of their local datasets and computational speed [19]. Such challenges are common in real-world Internet of Things application scenarios. For example, there are many edge devices with computing power in the Internet of Things (these devices can be diverse, including wearable sensors, mobile phones, watches, and smart appliances). The data they generate are also diverse, and they are widely distributed across the Internet of Things; although they want to participate in federated learning training, they do not want their privacy to be compromised.
Expensive communication represents a significant bottleneck in federated learning, where a considerable number of clients are widely distributed, necessitating local data retention. In actual IoT communication networks, communication can be several orders of magnitude slower than local computation [20,21], thereby highlighting the urgent need for the development of communication-efficient methods. This approach considers two key aspects: (i) reducing the total number of communication rounds and (ii) decreasing the size of the transmitted information per round.
In this work, we propose a new federated optimization algorithm that combines adaptive consensus ADMM. This method successfully overcomes heterogeneity and converges in fewer communication rounds.
The main contributions of this paper are summarized as follows.
  • We propose and demonstrate a communication-efficient federated optimization algorithm, which performs well in heterogeneous settings. The proposed method uses an adaptive consensus ADMM method to select local parameters. Instead of estimating a single global penalty parameter for all clients, different local penalty parameters are generated for each client based on the local gradient of the subproblems.
  • We propose and adopt a new method similar to the double-rising and multiplier method to update the objective function of the customer. By using this method, the gradients of the server and the individual clients are aligned.
  • By comparing with baselines in multiple popular datasets, we demonstrate that our federated optimization algorithm improves communication efficiency and is robust to client heterogeneity, especially in highly heterogeneous situations.
The rest of the paper is organized as follows. In Section 2, we provide an overview of related work. In Section 3, we review the core algorithmic idea of FL, and give the description of the proposed framework, FedACADMM. The convergence guarantee for FedACADMM is derived in Section 4 and the evaluation results are presented and analyzed in Section 5. Finally, Section 6 concludes the paper.

2. Related Work

Since FL was put forward [8], it has been committed to solving the security problems caused by the diversity and complexity of communication networks and the difficulties in obtaining private data [22]. In recent years, the research and discussion on federated optimization have been very active. Smith et al. [23] presented a federated multi-task learning approach for heterogeneity, MOCHA, which takes into account communication cost, stragglers, and error tolerance. This method is not suitable for non-convex models. The objective function of a non-convex model is not a convex function, which is a function where the straight line connecting any two points on the function lies completely above the function. Therefore, non-convex models can have multiple local minima and maximums, and their optimization problems are more complex. In non-convex settings, FedAvg is shown to be effective, but does not perform very well in the real-world situation of device heterogeneity [8]. FedProx modifies FedAvg by adding a weighted proximal term to the local objective function to account for heterogeneity [9]. It might be challenging to choose the weighted proximal term in FedProx for complicated purposes. In addition, subsequent research found that the answer to the question of whether FedProx can improve the speed of convergence is unclear [24]. A communication-efficient method, q-FFL, is inspired by techniques of fair resource allocation in wireless networks, and it alters the objective function of FedAvg by modifying the weights of different devices [25]. It performs well in both convex and non-convex models.
Numerous studies have been published that discuss the function of ADMM in distributed machine learning [11]. In [26], this idea derived from the R-ADMM as an application of the Peaceman–Rachford splitting. It demonstrated a virtually certain convergence of the proposed algorithm under broad assumptions about the distribution of communication loss and node activation events by utilizing findings from operator theory. The authors of [27] derived a perturbation strategy for ADMM in which the perturbed term is connected with the penalty parameters; it was shown that this simultaneously increases utility and privacy. The authors of [27] derived a perturbation strategy for ADMM, where the perturbed term is linked to the penalty parameters, which increases the utility and privacy at the same time.
The primal-dual methods are also widely used in FL. Based on the Peaceman–Rachford splitting approach, [28] developed FedSplit to solve the distributed convex problem and proved that FedSplit can solve FL problems under convexity only without any additional assumptions about the homogeneity of the system or data. FedSplit is only applicable to deterministic algorithms, excluding the random approximate gradient update method commonly utilized in practice. FedPD is a primitive dual-based meta-learning algorithm, which uses the dual variable to calculate the difference between local and server learning [29]. We note that all clients in FedPD are continuously participating in training. Ref. [30] considered both synchronous and asynchronous settings and applied the non-convex Douglas–Rachford splitting method, a random block coordinate strategy, and asynchronous updating to solve the non-convex composite optimization problem in federated learning. FedDR and asyncFedDR, unlike FedSplit and FedPD, only update a subset of customers at every round of updates and can update synchronously or asynchronously.
Our research was inspired by [31,32]. To increasing the robustness of federated optimization, in [31], FedADMM adds the local objective function with dual and augmented Lagrangian terms. It should be noted that this method works extremely effectively on large-scale networks with diverse data. Ref. [32] concentrated on a useful modification of the decomposition method by [16], allowing the penalty parameter to alter automatically based on specific adaptive principles. Unlike the original method, this method automatically varies the penalty parameters within an acceptable range and constructs appropriate adaptive rules. Combining the above methods, we propose the following solutions: (i) select only a subset of clients to participate in each iteration of the local update, (ii) adaptively update the penalty parameters before the local update, and (iii) formulate a suitable self-adaptive rule.

3. FedACADMM: Adaptive Consensus ADMM for Federated Learning

In this section, we introduce the key components of federated learning and ADMM, and then we formally present our proposed framework, FedACADMM. All the basic symbols throughout this article are listed in Table 1.

3.1. Federated Optimization Problem

The classical federated learning problem involves constructing a single, global statistical model from numerous clients. The primary objective of federated learning is to enable the model to be learned using client-generated data while staying within the constraints of local storage and processing. During periodic updates, clients communicate only with the central server. The ultimate aim of federated learning is typically to minimize the following objective function:
min w F ( w ) , where F ( w ) : = k = 1 m p k F k ( w )
Here, m represents the total number of devices, p k 0 , and k p k = 1 . Additionally, F k represents the local objective function for the k-th device. The local objective function is commonly defined as the empirical risk computed over local data. The local objective function is used to optimize the parameters of the model by minimizing the empirical risk over the local data. The optimization is performed locally, and the results are then aggregated to obtain a global model that can be used for inference or further training. A typical local objective function used in federated learning is F k ( w ) = 1 n k j k = 1 n k f j k w ; x j k , y j k , where n k is the number of samples available locally. To adjust the weight of each device, a user-defined term p k is introduced, where p k is the weight assigned to device k and can be set to 1 n or n k n depending on the desired effect. The total number of samples across all devices is denoted as n = k n k , which determines the scale of the learning task.
Convex objective refers to optimization problems in which the objective function is convex. Convex functions have the property that for any two points in their domain, the line segment connecting them lies on top of the graph of the function. Convex objectives are important in optimization because they have desirable mathematical properties that make them easier to optimize. For convex targets, the distributed local update primitive dual method utilizes a duality structure to effectively decompose the global target into sub-problems that can be solved in parallel in each round of communication. Hence, in this article, we decompose the objective function of federated learning into sub-problems using this method,
min x R d F ( x ) : = f ( x ) + g ( x ) = 1 n i = 1 m f i ( x ) + g ( x )
where m is the number of devices and the weighted local training loss is represented by f i , which is considered to be L-smooth and nonconvex, and g is a proper, closed, and convex regularizer. We emphasize that the use of regularizers g has been emphasized in various works, including [33]. The goal of federated learning is to solve (2) for the model parameters θ using data distributed across the clients.
Let dom ( F ) : = x R d : F ( x ) < + be the domain of F and g be the subdifferential of g [34]. Because (2) is nonconvex, we only anticipate discovering a stationary point, which meets the next optimality condition.
Definition 1. 
If 0 f ( x * ) + g ( x * ) , then x * is called a first-order stationary point of (2).
Assumption 1. 
dom ( F ) and F * : = inf x R d F ( x ) > .
Assumption 2. 
Each local loss function f i ( · ) is L-Lipschitz smooth, i.e., x , y R d , and the following inequalities exist:
f i ( x ) f i ( y ) L x y , x , y dom ( f i )
Federated optimization employs a local objective function based on individual device data as a proxy for the global objective function, thereby reducing communication.

3.2. Classical ADMM

ADMM is a powerful algorithm used for solving convex optimization problems in statistics and machine learning. This powerful technique decomposes the original problem into smaller subproblems that can be solved efficiently using dual ascent methods. After solving these subproblems, ADMM combines the obtained solutions using a multiplier method to achieve a high convergence speed. This algorithm solves the equality-constrained convex optimization problem,
min x R n , z R m f ( x ) + g ( z ) , s . t . A x + B z = c
where f : R n R and g : R m R are convex functions, A R p × n , B R p × m , and c R p . The Lagrangian for problem (4) is
L ( w , z , π ) = f ( w ) + g ( z ) + < A w + B z c , π > + σ 2 A w + B z c ,
where σ > 0 is called the penalty parameter. Augmented Lagrangian methods can bring robustness to the dual ascent method, especially in the absence of assumptions such as strict convexity or finiteness of f.
The iteration of ADMM can be written in the form
x k + 1 : = argmin x L ρ ( x , z k , y k ) z k + 1 : = argmin z L ρ ( x k + 1 , z , y k ) y k + 1 : = y k + ρ ( A x k + 1 + B z k + 1 c )
Here, ρ > 0 . By choosing an appropriate value for ρ , it can balance the rate of convergence and the feasibility of the iterates. The method, which consists of an x-minimization step, a z-minimization step, and a dual variable update, is very similar to dual ascent and the method of multipliers. The use of ρ as a step size in dual updates ensures that the iterates ( x k + 1 , z k + 1 ) remain dual feasible and leads to optimality as A x k + 1 + B z k + 1 c gradually converges to zero.
ADMM combines the decomposability of dual ascent with the superior convergence properties of multiplier methods, making it an efficient and widely used optimization tool.

3.3. FedACADMM Algorithm

We now turn to the description of a novel ADMM algorithm for federated optimization, and outline the FedACADMM method in Algorithm 1. Then, some discussions are offered to help throw light on the proposed method. The following straightforward yet crucial changes to FedACADMM lead to notable empirical advancements and also enable us to ensure convergence of the approach.
Algorithm 1 FedACADMM
Input: T, m, ρ t , S t , θ t , θ t 1 , E, η
Output: θ t + 1
1:
for t = 0 , 1 , , T 1 do
2:
   server selects a subset S t of m clients at random
3:
   server broadcasts θ t and θ t 1 to all chosen clients
4:
   for client i S t  do
5:
     check the convergence verification and update the variable parameter ρ t + 1 as in (8)
6:
     update client model as w i t + 1 for E epochs as in (9)
7:
     compute the difference Δ i t as in (10)
8:
     sent Δ i t back to the server
9:
    end for
10:
  server calculates average parameter θ t + 1 as in (11)
11:
end for
We begin by describing the iterative process in detail. At the start of the t-th round, a subset of clients (denoted as S t ) is chosen, and each chosen client downloads the current server model θ t and the last server model θ t 1 . Confirm whether the convergence verification w i θ t 2 + θ t θ t 1 2 < ε is satisfied, and stop the loop if it is satisfied. We give the definition of generalized residuals:
r t : = w i t θ t d t : = θ t θ t 1
Here, r t is the primal residual at iteration t, and d t is the dual residual at iteration t. We enable ρ t + 1 to fluctuate within an acceptable range from iteration to iteration:
ρ t + 1 = ρ t / ( 1 + τ i ) , if r t < μ d t ( 1 + τ i ) ρ t , if μ r t d t ρ t , otherwise
with μ ( 0 , 1 ) . This approach speeds up convergence by automating the local penalty parameter for each client, rather than relying on a global parameter for all clients. This approach eliminates the need to estimate a global penalty parameter, resulting in improved performance.
For the purpose of updating the client model locally:
L i ( w i t , y i t , θ t ) = f i ( w i t ) + ( y i t ) ( w i t θ t ) + ρ t + 1 2 w i t θ t 2
In the practical case, (9) does not need to be solved exactly, we can calculate the local model w i t inexactly according to w i L i w i t , y i t , θ t 2 ε i . Here, ε i 0 represents the local accuracy. The use of the augmented Lagrangian L i for training local model parameters w i t has proven to be effective. This is due to the fact that a suitable value for the quadratic proximal coefficient ρ can be determined, which allows the client accuracy level ε i to be easily achieved.
The parameter difference uploaded to the server is:
Δ i t = ( w i t + 1 + 1 ρ t + 1 y i t + 1 ) ( w i t + 1 ρ t + 1 y i t )
In our scenario, clients decide to perform different workloads to accommodate the heterogeneity of the system based on their local environment. In FedACADMM, each client holds a tuple of w i t , y i t , but it only transmits the enhanced model difference Δ i t . This means that the communication cost per round is the same as in FedAvg and FedProx.
After collecting update messages from chosen clients, the server refreshes the global model as follows:
θ t + 1 = θ t + η S t i S t Δ i t
Here, η > 0 is the global collecting step size. The inclusion of η mitigates the effects of data heterogeneity and random algorithmic turbulence on the experimental outcomes, reducing their impact. The global model of the central server updates the current model using the selected subset of clients S t m , which allows it to efficiently receive past information.
To date, a round of iterative updates has been accomplished.

4. Analysis

The goal of this section is to establish the global convergence for FedACADMM. We initially assert that the chosen stopping criterion is reasonable.
Let { ( w i , y i t , θ t ) } be the sequence created by our approach in the following analysis. We try to prove that { ( w i , y i t , θ t ) } converges to a solution point { ( w i * , y i * , θ * ) } . We rewrite the task of FL to find the optimal parameter { ( w i * , y i * , θ * ) } that minimizes the overall loss,
min w i , θ t i = 1 m α i f i w i , s . t . w i = θ t , i [ m ] .
we also denote F ( W ) : = i = 1 m α i f i w i , where F w , w , . . . , w = f w . The solution point { ( w i * , y i * , θ * ) } is a stationary point of (12) if it satisfies,
a i f i w i * + y i * = 0 , i [ m ] w i * θ * = 0 , i [ m ] i = 1 m y i * = 0
It is not difficult to prove that any locally optimal solution to problem (12) must satisfy (13). If f i is a convex function, then there is one and only one point that is a globally optimal solution when satisfying (13). Moreover, a solution point { ( w i * , y i * , θ * ) } of (12) indicates
f w * = i = 1 m α i f i w * = i = 1 m π i * = 0
So, w i * is also a solution point of
w i * : = argmin w R n f ( w )
Let us consider the differences between y t y * 2 and y t + 1 y * 2 . Combine y t y * 2 with the third equality of (7) and perform some modifications to get:
y t y * 2 = y t + 1 y * 2 γ 2 ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 + 2 γ ρ t + 1 ( y t y * ) ( w i t + 1 θ t + 1 )
We provide some lemmas to help prove the main convergence theory and the following lemma provides a desirable property of the last term of (16).
Lemma 1. 
For { ( w i * , y i * , θ * ) } , we have
( y t y * ) ( w i t + 1 θ t + 1 ) ρ t + 1 w i t + 1 θ t + 1 2 + ρ t + 1 × ( w i t + 1 w i * ) ( θ t + 1 θ t )
Proof of Lemma 1. 
For { ( w i * , y i * , θ * ) } , we have
( w i t + 1 w i * ) T f ( w i * ) y i * 0
( θ t + 1 θ * ) T g ( θ * ) y i * 0
w i t + 1 θ t + 1 = 0
In the previous local update:
( w i w i t + 1 ) T f ( w i t + 1 ) y i t ρ t + 1 ( w i t + 1 θ t ) 0
( θ θ t + 1 ) T g ( θ t + 1 ) y i t ρ t + 1 ( w i t + 1 θ t + 1 ) 0
Similarly, we can get:
( w i * w i t + 1 ) T f ( w i t + 1 ) y i t ρ t + 1 ( w i t + 1 θ t ) 0
( θ * θ t + 1 ) T g ( θ t + 1 ) y i t ρ t + 1 ( w i t + 1 θ t + 1 ) 0
We obtain by adding (18) and (23), and using the monotonicity of the operator f:
( w i t + 1 w i * ) T ( y i t y i * ) ρ t + 1 ( w i t + 1 θ t ) 0
The same can be obtained by adding (19) and (24):
( θ t + 1 θ * ) T ( y i t y i * ) ρ k + 1 ( w i t + 1 θ t + 1 ) 0
Now that (25) and (26) have been combined, together with (20), we have the statement of this lemma. □
So, we can obtain the inequality:
y t y * 2 y t + 1 y * 2 γ ( 2 γ ) ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 + 2 γ ρ t + 1 ( w i t + 1 w i * ) ( w i t + 1 θ t + 1 )
y t y * 2 + γ ( ρ t ) 2 θ t θ * 2 y t + 1 y * 2 + γ ( ρ t + 1 ) 2 θ t θ * 2 + γ ( 2 γ ) ( ρ t + 1 ) 2 × w i t + 1 θ t + 1 2 + γ ( ρ t + 1 ) 2 θ t θ t + 1 2 + 2 γ ρ t + 1 ( w i t + 1 θ t + 1 ) ( θ t + 1 θ t )
Lemma 2. 
For t 2 , we obtain
ρ t + 1 ( w i t + 1 θ t + 1 ) ( θ t + 1 θ t ) ( 1 γ ) ρ t ( w i t θ t ) ( θ t + 1 θ t )
We define γ ( 0 , ( 1 + 5 ) / 2 ) , and therefore, 1 + γ γ 2 > 0 . Assume T = 2 ( 1 / 3 ) ( 1 + γ γ 2 ) , and we can obtain:
2 T = ( 1 / 3 ) ( 1 + γ γ 2 )
T γ = ( 1 / 3 ) ( γ 2 4 γ + 5 )
Theorem 1. 
Let w i * , y * , θ * denote a solution to the variable inequality problem. Then, there exists a constant t 0 2 , if satisfied t t 0 :
y t + 1 y * 2 + γ ( ρ t + 1 ) 2 θ t + 1 θ * 2 + γ ( T γ ) × ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 ( 1 + τ i ) 2 ( y k y * 2 + γ ( ρ t ) 2 θ k θ * 2 + γ ( T γ ) ( ρ t ) 2 w i t θ t 2 ) γ ( 2 T ) × ( ρ t + 1 ) 2 ( w i t + 1 θ t + 1 2 + θ t + 1 θ t 2 )
Proof of Theorem 1. 
Combining (28), (29) and the Cauchy–Schwartz inequality, we obtain:
y t + 1 y * 2 + γ ( ρ t + 1 ) 2 θ t + 1 θ * 2 + γ ( T γ ) × ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 y k y * 2 + γ ( ρ t ) 2 θ k θ * 2 + γ ( T γ ) ( ρ t ) 2 w i t θ t 2 γ ( 2 T ) ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 γ ( ρ t + 1 ) 2 × θ t + 1 θ t 2 + ρ t ρ t + 1 [ γ ( 1 γ ) 2 ] / ( T γ ) × θ t θ t + 1 2
Assuming
[ 1 / ( 1 + τ i ) ] ρ t ρ t + 1 ( 1 + τ i ) ρ t
and lim k τ k = 0 is established, after a series of transformations, we obtain Theorem 1. □
Theorem 2. 
Let the sequence ( w i t , y t , θ t ) produced by our method, we obtain:
lim t w i θ t 2 + θ t θ t 1 2 = 0
Theorem 3. 
Let the sequence ( w i t , y t , θ t ) produced by our method, we obtain:
lim t w i θ t 2 + θ t θ t 1 2 = 0
Proof of Theorem 3. 
From τ i 0 , i = 0 τ i < and n = 1 ( 1 + τ i ) , i = 1 τ i 2 are bounded. Suppose
C s = i = 1 ( 2 τ i + τ i 2 ) , C p = i = 1 ( 1 + τ i )
It follows from Theorem 1 that, for t 0 t ,
y t + 1 y * 2 + γ ( ρ t + 1 ) 2 θ t + 1 θ * 2 + γ ( T γ ) × ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 ( t = t 0 t ( 1 + τ i ) ) 2 ( y t 0 y * 2 + γ ( ρ t 0 ) 2 θ t 0 θ * 2 + γ ( T γ ) ( ρ t 0 ) 2 w i t 0 θ t 0 2 ) ) C p 2 ( y t 0 y * 2 + γ ( ρ t 0 ) 2 θ t 0 θ * 2 + γ ( T γ ) ( ρ t 0 ) 2 w i t 0 θ t 0 2 ) )
As a result, there exists a constant C > 0 such that
y t + 1 y * 2 + γ ( ρ t + 1 ) 2 θ t + 1 θ * 2 + γ ( T γ ) × ( ρ t + 1 ) 2 w i t + 1 θ t + 1 2 C , t 0
Combine (32) and (37), we obtain
t = t 0 γ ( 2 T ) ( ρ t + 1 ) 2 ( w i t + 1 θ t + 1 2 + θ t + 1 θ t 2 ) y t 0 y * 2 + γ ( ρ t 0 ) 2 θ t + 1 θ * 2 + γ ( T γ ) × ( ρ t 0 ) 2 w i t + 1 θ t + 1 2 t = t 0 ( 2 τ i + τ i 2 ) ( y t y * 2 + γ ( ρ t ) 2 θ t θ * 2 + γ ( T γ ) ( ρ t ) 2 w i t θ t 2 ) ( 1 + C s ) C
and then
lim t γ ( 2 T ) ( ρ t + 1 ) 2 ( w i t + 1 θ t + 1 2 + θ t + 1 θ t 2 ) = 0
This means that the sequence { ρ t } is both an upper bound and a lower bound away from zero, which means that we have
inf ρ t 1 : = B L > 0 , sup ρ t 1 : = B U < .
and from (30), Theorem 3 is proved. □
If the sequence { ρ t } is generated by (8), then it satisfies the following two conditions:
Condition C1. inf ρ t 1 = B L > 0 and t = 1 η t 2 < + , where
θ t = 1 ρ t / ρ t + 1 2 , if t T D , 0 , otherwise , K D : = t ρ t + 1 < ρ t .
Condition C2. t = 1 θ t 2 < + , where
η t = ρ t + 1 / ρ t 2 1 , if t T I , 0 , otherwise , T I : = t ρ t + 1 > ρ t .
In this case, we can obtain
t T I = t ρ t + 1 > ρ t , t T D = t ρ t + 1 < ρ t ,
So, it can be inferred that
η t 2 = 2 τ t + τ t 2 , θ t 2 = 2 τ t + τ t 2 .
Thus,
t = 1 η t 2 < and t = 1 θ t 2 < .
Therefore, the proposed method using the parameter-adjusting strategy in (8) is convergent. Two sequences converge to the optimal function value of (15), namely,
lim t F W t = lim t f w t = f *
Any accumulating point w i , y , θ of sequence w i * , y * , θ * is an optimal solution to (12), where w is an optimal solution to (15). Our algorithm implementation has been demonstrated to achieve both global and linear convergence, under the assumption that the Lagrange multipliers satisfy specific initialization conditions and provided that the local gradients are Lipschitz continuous.

5. Experiments

This section runs multiple numerical experiments to evaluate the performance of FedACADMM. We first describe our experimental setup including datasets, models, and data distribution in Section 5.1. We then demonstrate the effect of modifying the penalty parameters and compare our proposed experiments with several baseline methods in Section 5.2 and Section 5.3.

5.1. Experimental Setup

Datasets and Models. In our study, we adopted the standard federated configuration where every client owns their local data and communicates with the central server to transmit and receive information. Our experiments involved handwriting recognition and image classification using three well-known benchmark datasets: MNIST [35], Fashion MNIST (FMNIST) [36], and CIFAR-10 [37]. MNIST consists of handwritten digits from zero to nine. FMNIST includes images of 10 different categories of clothing, such as T-shirts, pants, and sneakers. CIFAR-10 is a dataset composed of 10 different types of objects, including aircraft, cars, dogs, and so on. To conduct these experiments, we utilized two CNN models, each comprising two convolutional layers, two fully connected layers, and a linear transformation layer. The detailed architecture of the models is provided in Table 2.
Data Distribution. To investigate federated optimization and conduct comparative experiments, we adopted the data distribution settings used in the FedAvg approach. Specifically, we conducted experiments using two different data division methods: IID (homogeneous) and non-IID (heterogeneous). Under the IID method, clients receive an equal distribution of data. Under the non-IID method, we arranged the training data by labels and then distributed them evenly among the shards. Each client was randomly and evenly assigned two shards.
Validation metrics. To assess the generalization performance of our approaches, we applied the entire test set in well-known federated datasets. Because convergence speed and final performance are important indicators of federated learning, we computed the performance attained at a specified number of rounds and the number of communication rounds required for the algorithm to reach the test target accuracy. Typically, the ADMM approach converges to a reasonable accuracy within a few tens of iterations. Therefore, we selected one hundred experimental rounds for all datasets after preliminary estimation. We acquired a reasonable evaluation target accuracy displayed in Table 3. For techniques that did not achieve the desired accuracy within the maximum communication round ( t > 100 ), we appended a plus sign (+) to the communication round. Additionally, we reported the communication round savings of FedACADMM compared to the classical FedAvg algorithm.
Implementation Details. We have implemented FedACADMM in PyTorch, which involves simulating a federated learning network with one server and m clients. To enable comparisons with other baselines, we reused the implementations of FedAvg and FedProx. All of our numerical experiments were conducted on a cloud computing platform, where we used a virtual machine with the following specifications: 4 vCPUs, 16 GB memory, and an NVIDIA RTX A4000 GPU. The operating system utilized for the training was Ubuntu 18.04 LTS, and we employed PyTorch 1.8.1 as our deep learning framework for training.
We set the global learning rate to 1 for all methods and used a client subset size equal to 10% of the total number of customers ( C = 0.1 ) . For all algorithms, we utilized the SGD optimizer with a learning rate of 0.1 for local updates.

5.2. Choosing the Initial Penalty Parameter

When designing the evaluation method for FedACADMM, the first consideration in this paper was determining the appropriate initial self-adaptive penalty parameter, denoted as ρ 0 , which is crucial for (8). To achieve a balanced initial residual w i t θ t and pairwise residual θ t θ t 1 , an empirical rule-based heuristic optimization hyperparameter method was employed. This is because if one of the two terms approaches zero gradually, the algorithm would converge slowly. Another consideration in proposing the rule is to prevent additional computation. In particular, a practical approach in practice would be to run FedACADMM with various values of ρ 0 in parallel to obtain multiple final global models, and then select between these based on performance (e.g., accuracy) on the validation data. In this paper, we adopted an empirical rule-based heuristic optimization method to narrow down the search space and select a value for ρ 0 from a limited set of candidates, namely 10 3 , 10 2 , 0.1 , 1 , 10 , following the experimental setups of [9,32].
Figure 2 displays a comparison of test accuracy curves and training loss curves across various initial value penalty values, denoted as ρ 0 . For clarity, we omit the curves associated with poor experimental results when ρ 0 = 10 . Our experiments demonstrate that FedACADMM is a robust approach for both homogeneous and heterogeneous data distributions. Furthermore, our results indicate that FedACADMM performs well within a limited number of communication rounds for different datasets, and is not highly sensitive to the choice of initial variable parameters ρ 0 . FedACADMM ( ρ > 0 ) has better performance when the initial value is ρ 0 = 10 2 .
Although it is not as good as ρ 0 = 10 3 in most cases, its performance is very stable without large deviations, and it maintains robust convergence. Therefore, we chose ρ 0 = 10 2 as the initial penalty parameter value for the comparison experiment with the baselines. When the initial penalty parameter value is ρ 0 = 10 2 , the global model can effectively integrate local data and achieve robust and fast updates. It maintains a nice balance between w i t θ t and θ t θ t 1 . Figure 2d shows that ρ 0 = 10 3 stalls, which is why we did not choose it as the initial penalty parameter value. The penalty parameter that is chosen will affect how quickly ADMM converges in practice.
When we choose the appropriate initial penalty parameter value, the proposed algorithm is robust and can achieve good results in both homogeneous and heterogeneous situations. In practical applications, the convergence speed of FedACADMM is dependent on the chosen penalty parameter. The actual convergence of FedACADMM is particularly sensitive to this parameter selection. For instance, when the initial penalty parameters are set to ρ 0 = 10 3 , 10 2 , 10 , the convergence is usually very slow. However, FedACADMM displays robustness to the choice of initial penalty parameters ρ 0 . When appropriate values of the initial penalty parameters are selected, the FedACADMM algorithm demonstrates strong robustness and achieves excellent model training results and fast convergence in both IID and non-IID data distribution cases.

5.3. Comparison with Other Baselines

Comparing the proposed algorithm with existing mainstream federal optimization methods is crucial for assessing model performance. However, due to the constraints of the current communication channels, reducing the communication overhead has remained a challenge for FL. Therefore, it is reasonable to anticipate that the proposed algorithm would minimize the number of communication rounds while maintaining good heterogeneity.
In the ensuing comparative experiments, we aim to contrast the performance of FedACADMM against that of FedAvg and FedProx. Specifically, we investigate their efficacy across three extensive empirical datasets, various data distribution techniques, and diverse client numbers. Our experiments generated promising results, which was really exciting. Table 4 shows the communication rounds required to accomplish the test target accuracy under various conditions. In Table 4, we report the number of communication rounds required by each of the three algorithms to attain the target accuracy for three distinct datasets, varying data distributions, and a range of client population sizes. Additionally, we provide a graphical representation of the results in Figure 3, which facilitates a more insightful interpretation of the outcomes. The statistics of communication reduction by FedACADMM over FedAvg are shown in Table 5. It can be observed that in most circumstances, the suggested algorithm FedACADMM can require less rounds than the baselines to attain the test target accuracy. To visually communicate our experimental findings, we have chosen to present a selection of representative results as images, as shown in Figure 4. This method provides a clear and concise representation of the data. In Figure 4a–d, the experimental results of each algorithm in the FMNIST dataset and the data distribution of IID are shown with different numbers of clients. We can observe that the performances of the three algorithms are quite good under homogeneous conditions, and the test accuracy of FedACADMM is slightly better than the other two algorithms. In Figure 4i–l, the experimental results of each algorithm in the CIFAR-10 dataset and the data distribution of non-IID are shown with different numbers of clients. When the number of clients is small, the test accuracy of FedACADMM is almost the same as that of the other two algorithms, but its advantage becomes obvious as the number of clients increases; as such, we guess that it is suitable for large-scale communication. In Figure 4e–h, the experimental results of each algorithm in the FMNIST dataset and the data distribution of non-IID are shown with different numbers of clients. It will not be repeated because its experimental results are similar to those of the CIFAR-10 dataset. In Figure 5, the corresponding train loss of test accuracy in Figure 4 is shown, and it can be seen that the convergence performance of FedACADMM is equal to the other two algorithms in most heterogeneous cases, and the convergence speed can be better than the other two algorithms in homogeneous cases. In Figure 6, we selectively display the test losses for some of the experiments that are shown in Figure 4. Figure 7 displays the confusion matrix, which indicates the performance of the predicted classes. As shown, the proposed model achieves high accuracy in classifying each dataset.
Through the detailed analysis above, we can see some advantages of FedACADMM:
Communication efficiency. Under the homogeneous data distribution method, almost all experimental methods can quickly achieve the test target accuracy, and the increase in the number of clients enables the advantages of our proposed method to be reflected. However, a homogeneous experimental setup is not informative for real-life situations. On the other hand, in a heterogeneous experimental environment, the proposed method gradually exhibits robustness with the complexity of the environment. Our proposed method successfully handles the client drift problem in diverse environments and is optimized for the local training problem.
Data Heterogeneity. Across all range of values tried, we observe that the proposed method achieves test target accuracy almost faster than baselines in both homogeneous and heterogeneous experimental settings. As the number of customers increases, the performance gap between FedACADMM and baseline gradually widens. Moreover, client sampling has a relatively small impact on the proposed method when compared to baseline experiments. With no need to manually adjust the number of local epochs, our method solves the problem of statistical heterogeneity by limiting the local updates to be more similar to the global model.

5.4. Global Collecting Step Size

As we note in Section 5.2, although ρ 0 = 10 3 is desirable in most cases, it is not satisfactory for FMNIST in the 1000-client and non-IID data distribution settings. In subsequent experiments, we study the convergence of the model for tuning the server cloth length under a data distribution of 1000 clients and non-IID data distributions. The reduction in data heterogeneity and random algorithm turbulence that can affect experimental results can be achieved by the global collecting step size η , as demonstrated in (11).
So we fix ρ 0 = 10 3 and alter η { 0.1 , 0.2 , 0.4 , 0.5 , 0.6 , 0.8 , 1 } . As shown in Figure 8a, η ( 0.4 , 0.6 ) gives better test accuracy and robust convergence with ρ 0 = 10 3 for FMNIST in the 1000-client and non-IID data distributions setting. When the value of initial variable parameter ρ 0 is small, reducing η synchronously to a rational range promotes better convergence. To further validate our assumption, we conduct experiments with the condition ρ 0 = 10 2 , which was excellent in the previous experiment of changing ρ 0 . It can be concluded from Figure 8b that η = 1 is the best choice at ρ 0 = 10 2 , and the test accuracy quickly reaches the optimal value and the convergence is robust and stable.
Optimizing the global collecting step size η is critical for mitigating data heterogeneity and random algorithmic turbulence that could impede experimental outcomes. An appropriately selected global collecting step size η ensures efficient information exchange between clients and the server, leading to improved aggregation at the server. An effective combination of η and ρ 0 can make the algorithm experiment more effective.

6. Conclusions

In this paper, we discuss a practical scenario for federated learning that is characterized by a large number of clients with heterogeneous data and limited participation constraints. This scenario poses a challenge for achieving satisfactory convergence and model performance. To address this problem, we propose FedACADMM, a new federated objective optimization algorithm inspired by the adaptive consensus ADMM strategy in distributed optimization. Our proposed approach minimizes the communication cost per round, requiring storing additional variables at each client. This leads to a significant speed improvement as compared to the state-of-the-art model while reducing communication costs. We conducted extensive experiments on realistic datasets, and the results demonstrate that the algorithm is suitable for joint learning scenarios involving a large number of clients with heterogeneous data and limited participation constraints. Furthermore, the proposed algorithm exhibits good robustness and resilience and can achieve the target accuracy faster than the original methods while improving communication efficiency.
Federated learning has gained significant attention due to its privacy-preserving effect, achieved by uploading only the model without the raw data, and it is a rapidly growing research area. Recently, there has been increasing interest in characterizing the convergence rate of both centralized and distributed implementations of ADMM. In optimizing communication efficiency, the trade-off between reduced communication and accuracy is a critical focus of federated learning. The two most commonly studied solutions in federated optimization are bulk synchronous approaches and asynchronous approaches, where the delay is assumed to be bounded. In this paper, we only considered the bulk synchronous approach. However, in the future, we plan to consider more realistic device-centric communication schemes where each device can decide to communicate with the central server at its convenience, i.e., the asynchronous update approach.

Author Contributions

Conceptualization and writing—original draft preparation, S.H.; supervision, J.Z.; writing—review and editing, M.F. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61761004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in reference number [36,37].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Khan, L.U.; Saad, W.; Han, Z.; Hossain, E.; Hong, C.S. Federated learning for internet of things: Recent advances, taxonomy, and open challenges. IEEE Commun. Surv. Tutor. 2021, 23, 1759–1799. [Google Scholar] [CrossRef]
  2. Number of Connected IoT Devices Will Surge to 125 Billion by 2030, IHS Markit Says. Available online: https://sst.semiconductor-digest.com/2017/10/number-of-connected-iot-devices-will-surge-to-125-billion-by-2030/ (accessed on 24 October 2017).
  3. Abdalzaher, M.S.; Elwekeil, M.; Wang, T.; Zhang, S. A deep autoencoder trust model for mitigating jamming attack in IoT assisted by cognitive radio. IEEE Syst. J. 2021, 16, 3635–3645. [Google Scholar] [CrossRef]
  4. Yuan, J.; Yu, S. Privacy preserving back-propagation neural network learning made practical with cloud computing. IEEE Trans. Parallel Distrib. Syst. 2013, 25, 212–221. [Google Scholar] [CrossRef]
  5. Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Privacy aware learning. J. ACM (JACM) 2014, 61, 1–57. [Google Scholar] [CrossRef]
  6. Charles, Z.; Papailiopoulos, D. Gradient coding using the stochastic block model. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1998–2002. [Google Scholar]
  7. He, L.; Bian, A.; Jaggi, M. Cola: Decentralized linear learning. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  8. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics (PMLR 2017), Sydney, Australia, 6–11 August 2017; pp. 1273–1282. [Google Scholar]
  9. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
  10. Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; Khazaeni, Y. Federated Learning with Matched Averaging. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  11. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
  12. He, B.; Yuan, X. On the O(1/n) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 2012, 50, 700–709. [Google Scholar] [CrossRef]
  13. Ouyang, H.; He, N.; Tran, L.; Gray, A. Stochastic Alternating Direction Method of Multipliers. In Proceedings of the International Conference on Machine Learning (PMLR), Atlanta, GA, USA, 16–21 June 2013; pp. 80–88. [Google Scholar]
  14. Xu, Z.; Taylor, G.; Li, H.; Figueiredo, M.A.; Yuan, X.; Goldstein, T. Adaptive consensus ADMM for distributed optimization. In Proceedings of the International Conference on Machine Learning (PMLR 2017), Sydney, Australia, 6–11 August 2017; pp. 3841–3850. [Google Scholar]
  15. Song, C.; Yoon, S.; Pavlovic, V. Fast ADMM algorithm for distributed optimization with adaptive penalty. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  16. He, B.; Yang, H.; Wang, S. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. J. Optim. Theory Appl. 2000, 106, 337–356. [Google Scholar]
  17. McMahan, H.B.; Ramage, D.; Talwar, K.; Zhang, L. Learning differentially private recurrent language models. arXiv 2017, arXiv:1710.06963. [Google Scholar]
  18. Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
  19. Dennis, D.K.; Li, T.; Smith, V. Heterogeneity for the win: One-shot federated clustering. In Proceedings of the International Conference on Machine Learning (PMLR 2021), Virtual, 18–24 July 2021; pp. 2611–2620. [Google Scholar]
  20. Huang, J.; Qian, F.; Guo, Y.; Zhou, Y.; Xu, Q.; Mao, Z.M.; Sen, S.; Spatscheck, O. An in-depth study of LTE: Effect of network protocol and application behavior on performance. ACM SIGCOMM Comput. Commun. Rev. 2013, 43, 363–374. [Google Scholar] [CrossRef]
  21. Van Berkel, C. Multi-core for mobile phones. In Proceedings of the 2009 Design, Automation & Test in Europe Conference & Exhibition, Nice, France, 20–24 April 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1260–1265. [Google Scholar]
  22. Wahab, O.A.; Mourad, A.; Otrok, H.; Taleb, T. Federated machine learning: Survey, multi-level classification, desirable criteria and future directions in communication and networking systems. IEEE Commun. Surv. Tutor. 2021, 23, 1342–1397. [Google Scholar] [CrossRef]
  23. Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  24. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
  25. Li, T.; Sanjabi, M.; Beirami, A.; Smith, V. Fair resource allocation in federated learning. arXiv 2019, arXiv:1905.10497. [Google Scholar]
  26. Bastianello, N.; Carli, R.; Schenato, L.; Todescato, M. Asynchronous distributed optimization over lossy networks via relaxed ADMM: Stability and linear convergence. IEEE Trans. Autom. Control 2020, 66, 2620–2635. [Google Scholar] [CrossRef]
  27. Zhang, X.; Khalili, M.M.; Liu, M. Improving the privacy and accuracy of ADMM-based distributed algorithms. In Proceedings of the International Conference on Machine Learning (PMLR 2018), Vienna, Austria, 25–31 July 2018; pp. 5796–5805. [Google Scholar]
  28. Pathak, R.; Wainwright, M.J. FedSplit: An algorithmic framework for fast federated optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 7057–7066. [Google Scholar]
  29. Zhang, X.; Hong, M.; Dhople, S.; Yin, W.; Liu, Y. Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data. arXiv 2020, arXiv:2005.11418. [Google Scholar]
  30. Tran Dinh, Q.; Pham, N.H.; Phan, D.; Nguyen, L. FedDR–randomized Douglas-Rachford splitting algorithms for nonconvex federated composite optimization. Adv. Neural Inf. Process. Syst. 2021, 34, 30326–30338. [Google Scholar]
  31. Gong, Y.; Li, Y.; Freris, N.M. FedADMM: A Robust Federated Deep Learning Framework with Adaptivity to System Heterogeneity. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2575–2587. [Google Scholar]
  32. Wang, S.; Liao, L. Decomposition method with a variable parameter for a class of monotone variational inequality problems. J. Optim. Theory Appl. 2001, 109, 415–429. [Google Scholar] [CrossRef]
  33. Yuan, H.; Zaheer, M.; Reddi, S. Federated composite optimization. In Proceedings of the International Conference on Machine Learning (PMLR 2021), Virtual, 18–24 July 2021; pp. 12253–12266. [Google Scholar]
  34. Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: Berlin/Heidelberg, Germany, 2011; Volume 408. [Google Scholar]
  35. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  36. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
  37. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 3 March 2023).
Figure 1. A framework for FL. The IoT devices, which act as the clients in federation learning, send the trained model to the central server, rather than the raw data. In this setup, remote devices communicate with the central server periodically to learn the global model. The steps performed in each round of communication are illustrated in the figure.
Figure 1. A framework for FL. The IoT devices, which act as the clients in federation learning, send the trained model to the central server, rather than the raw data. In this setup, remote devices communicate with the central server periodically to learn the global model. The steps performed in each round of communication are illustrated in the figure.
Applsci 13 05270 g001
Figure 2. Choice of the initial variable parameter ρ 0 .
Figure 2. Choice of the initial variable parameter ρ 0 .
Applsci 13 05270 g002
Figure 3. The number of communication rounds required by the three algorithms to achieve the set target accuracy under different experimental settings.
Figure 3. The number of communication rounds required by the three algorithms to achieve the set target accuracy under different experimental settings.
Applsci 13 05270 g003
Figure 4. Comparison of the performance of the proposed FedACADMM algorithm with FedAvg and FedProx on three datasets with different data distributions between different clients.
Figure 4. Comparison of the performance of the proposed FedACADMM algorithm with FedAvg and FedProx on three datasets with different data distributions between different clients.
Applsci 13 05270 g004
Figure 5. Training loss of Figure 4.
Figure 5. Training loss of Figure 4.
Applsci 13 05270 g005
Figure 6. The test loss for each dataset containing 1000 customers, as shown in Figure 4, are listed below.
Figure 6. The test loss for each dataset containing 1000 customers, as shown in Figure 4, are listed below.
Applsci 13 05270 g006
Figure 7. Confusion matrix using FedACADMM.
Figure 7. Confusion matrix using FedACADMM.
Applsci 13 05270 g007
Figure 8. (a) Change in the global step-size η at the initial variable parameter ρ 0 = 10 3 . In order to better show convergence, some train loss curves with poor performance are not shown. Additionally, we use the performance of ρ = 10 2 , η = 1 as a reference line. All experiments were performed in the 1000-client and non-IID data distributions setting on FMNIST. (b) Change in the global step-size η at the initial variable parameter ρ 0 = 10 2 . All experiments were performed in the 1000-client and non-IID data distributions setting on FMNIST.
Figure 8. (a) Change in the global step-size η at the initial variable parameter ρ 0 = 10 3 . In order to better show convergence, some train loss curves with poor performance are not shown. Additionally, we use the performance of ρ = 10 2 , η = 1 as a reference line. All experiments were performed in the 1000-client and non-IID data distributions setting on FMNIST. (b) Change in the global step-size η at the initial variable parameter ρ 0 = 10 2 . All experiments were performed in the 1000-client and non-IID data distributions setting on FMNIST.
Applsci 13 05270 g008
Table 1. Symbols and Definitions.
Table 1. Symbols and Definitions.
Symbol               Definition
θ t The global model of the t-th round
w i The local model computed by client i
S t A subset of clients chosen at the t-th round
ε The stopping criterion
ρ t The self-adaptive penalty parameter at the t-th round
y i The local dual variable kept by client i
δ i t The local update at the t-th round by client i
η The global step-size
tThe t-th round global iteration
τ i Hyperparameters for regulating the range of ρ t + 1 variations
Table 2. Datasets and models.
Table 2. Datasets and models.
DatasetsTask# ParameterModel
MNISTHandwritten recognition1,663,3702-layer CNN + FC
CIFAR-10Image classification1,105,0982-layer CNN + FC
FMNISTImage classification1,663,3702-layer CNN + FC
Table 3. Target Accuracy. For better representation of figures and tables, we set different target accuracy for different data distributions in the same dataset.
Table 3. Target Accuracy. For better representation of figures and tables, we set different target accuracy for different data distributions in the same dataset.
IIDNon-IID
MNIST98%96%
CIFAR-1050%45%
FMNIST85%80%
Table 4. Comparison of the performance of the proposed FedACADMM algorithm with FedAvg and FedProx on three datasets with different data distributions between different clients. In the header of this table, I represents IID and N represents non-IID. The best performance in each column is denoted in bold.
Table 4. Comparison of the performance of the proposed FedACADMM algorithm with FedAvg and FedProx on three datasets with different data distributions between different clients. In the header of this table, I represents IID and N represents non-IID. The best performance in each column is denoted in bold.
MNIST(I)CIFAR-10(I)FMNIST(I)MNIST(N)CIFAR-10(N)FMNIST(N)
100200500100010020050010001002005001000100200500100010020050010001002005001000
FedACADMM91319154915213571634221410129121316151513
FedAvg183885100+132980100+8164697324349652823497920323851
FedProx2755100+100+193994100+162252100+385164792923538429344961
Table 5. Reduced communication rounds for FedACADMM compared to FedAvg in a heterogeneous experimental setup.
Table 5. Reduced communication rounds for FedACADMM compared to FedAvg in a heterogeneous experimental setup.
1002005001000
MNIST−6.3%48.8%71.4%84.6%
CIFAR-1057.1%60.9%75.5%83.5%
FMNIST20%53.1%60.5%74.5%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, S.; Zheng, J.; Feng, M.; Chen, Y. Communication-Efficient Federated Learning with Adaptive Consensus ADMM. Appl. Sci. 2023, 13, 5270. https://doi.org/10.3390/app13095270

AMA Style

He S, Zheng J, Feng M, Chen Y. Communication-Efficient Federated Learning with Adaptive Consensus ADMM. Applied Sciences. 2023; 13(9):5270. https://doi.org/10.3390/app13095270

Chicago/Turabian Style

He, Siyi, Jiali Zheng, Minyu Feng, and Yixin Chen. 2023. "Communication-Efficient Federated Learning with Adaptive Consensus ADMM" Applied Sciences 13, no. 9: 5270. https://doi.org/10.3390/app13095270

APA Style

He, S., Zheng, J., Feng, M., & Chen, Y. (2023). Communication-Efficient Federated Learning with Adaptive Consensus ADMM. Applied Sciences, 13(9), 5270. https://doi.org/10.3390/app13095270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop