ADMM-Based Differential Privacy Learning for Penalized Quantile Regression on Distributed Functional Data

Zhou, Xingcai; Xiang, Yu

doi:10.3390/math10162954

Open AccessArticle

ADMM-Based Differential Privacy Learning for Penalized Quantile Regression on Distributed Functional Data

by

Xingcai Zhou

^*

and

Yu Xiang

School of Statistics and Data Science, Nanjing Audit University, Nanjing 211085, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(16), 2954; https://doi.org/10.3390/math10162954

Submission received: 25 July 2022 / Revised: 10 August 2022 / Accepted: 12 August 2022 / Published: 16 August 2022

(This article belongs to the Special Issue Statistical Modeling for Analyzing Data with Complex Structures)

Download

Browse Figures

Versions Notes

Abstract

:

Alternating Direction Method of Multipliers (ADMM) is a widely used machine learning tool in distributed environments. In the paper, we propose an ADMM-based differential privacy learning algorithm (FDP-ADMM) on penalized quantile regression for distributed functional data. The FDP-ADMM algorithm can resist adversary attacks to avoid the possible privacy leakage in distributed networks, which is designed by functional principal analysis, an approximate augmented Lagrange function, ADMM algorithm, and privacy policy via Gaussian mechanism with time-varying variance. It is also a noise-resilient, convergent, and computationally effective distributed learning algorithm, even if for high privacy protection. The theoretical analysis on privacy and convergence guarantees is derived and offers a privacy–utility trade-off: a weaker privacy guarantee would result in better utility. The evaluations on simulation-distributed functional datasets have demonstrated the effectiveness of the FDP-ADMM algorithm even if under high privacy guarantee.

Keywords:

distributed machine learning; ADMM; quantile regression; functional principal component analysis; differential privacy

MSC:

62G05; 62G08; 62G20

1. Introduction

Machine learning is becoming more and more common in statistical modeling and data analysis, along with the increasing concerns about the privacy disclosure of data. Therefore, we urgently need to develop algorithms which can provide privacy protection for personal data. In turn, the demand for data privacy protection has stimulated the establishment of formal standards on data privacy and the development of privacy framework. Among them, differential privacy (DP) [1,2] is the most widely discussed and developed technique in theory [3,4,5,6], and the feasibility of adopting these theories is shown among others by [7,8,9]. The framework of DP makes it convenient for us to construct privacy protection algorithms. However, these privacy protection algorithms may also need to pay a price of sacrificing the rate of convergence in statistical accuracy. Therefore, we need to develop differential privacy distributed learning algorithms that do not sacrifice statistical accuracy as much as possible for large-scale distributed data.

Distributed machine learning can disassemble the original huge training task into multiple sub-tasks, that is, transforming large-scale learning that one machine can not afford during collaborative learning with multiple machines. Recently, ref. [10] gave a sparse distributed learning solution for high-dimensional problems; ref. [11] proposed a distributed learning algorithm which segments features in a high-dimensional sparse additive model and proved the consistency of the sparse patterns for each additive component; ref. [12] provided a more flexible framework using the communication-efficient surrogate likelihood (CSL) procedure, which can solve different settings such as M-estimation for low- and high-dimensional problems and Bayesian inference. Ref. [13] extended the CSL method to distributed quantile regression and then established some statistical properties under quantile loss, which does not satisfy the smoothness condition of CSL method. In distributed learning, each sub-task is executed separately on an independent machine. It allows each machine to complete a collective learning objective, which is usually a standardized empirical risk minimization problem. The individual data that do not need to be disclosed will be calculated in a local iterative algorithm, and the parameters will be transferred between the central machine and each local machine. At present, the widely used algorithms for the decentralized distributed learning problems mainly include subgradient-based algorithms [14,15], alternating direction method of multipliers(ADMM) [16,17,18,19], and the combination algorithms of these methods [20]. Ref. [21] proved that ADMM-based algorithms converge at the rate of

O (1 / L)

, while subgradient-based algorithms usually converge at the rate of

O (1 / \sqrt{L})

, where L is the number of iterations. Therefore, in this paper, we adopt an ADMM-based distributed learning algorithm against privacy disclosure and keep a statistical guarantee.

We know that sensitive individual information may be leaked in optimization algorithms as a result of sharing information such as parameters and/or gradients of the model between machines, as presented in [22,23]. The same problem exists in our ADMM-based distributed algorithm: how to avoid privacy leakage. So, we need to protect privacy via a DP mechanism while maintaining statistical accuracy in our distributed learning. Ref. [24] studied a class of regularized empirical risk minimization machine learning problems via ADMM and proposed the dual variable perturbation and the primal variable perturbation methods for dynamic differential privacy. Ref. [25] proposed a privacy-preserving cooperative learning scheme, where users are allowed to train independently using their own data and only share some updated model parameters. They used an asynchronous ADMM approach to accelerate learning. In addition, their algorithm integrates secure computing and distributed noise generation to ensure the confidentiality of shared parameters during the asynchronous ADMM algorithm process. Ref. [26] applied a new privacy-preserving distributed machine learning (PS-ADMM) algorithm based on stochastic ADMM, which provides a privacy guarantee by perturbing the gradient and has a low computational cost. In the paper, we focus on a functional linear regression model for functional data analysis via ADMM-based distributing learning to keep DP and statistical efficiency.

Functional data are natural generalizations of multivariate data from finite dimensional to infinite dimensional which are obtained by observing a number of subjects over time, space, and other continua. In practice, functional data are frequently recorded by an instrument, which involves a large number of repeated measurements per subject. They can be curves, surfaces, images, or other complex objects; see some real data sets in the monographs [27,28,29]. All in all, a functional datum is not a single observation but rather a set of measurements along a continuum; taken together, they are regarded as a single entity, curve, or image. In recent decades, functional data analysis has drawn considerable attention because advanced technology makes functional data easier to collect in applied fields such as medical studies, speech recognition, biological growth, climatology, online auctions, and so on. Time series data are treated as multivariate data because they are given as a finite discrete time series. In addition, longitudinal data, which are often observed in biomedical follow-up studies, are strongly linked with functional data; however, their use often involves several (few) measurements per subject taken intermittently at different time points for different subjects. Therefore, functional data and longitudinal data are also intrinsically different. In addition, some classic multivariate data analysis tools, applied to time series and longitudinal data analysis, cannot be directly applied to functional data analysis because they ignore the fact that the underlying object of the measurements of a subject is a function such as curve or surface. We know that functional data are intrinsically infinite dimensional, and our analysis methods cannot be based on the assumption that the values observed at different times for a single subject are independent because of the intra-observation dependence. The high intrinsic dimensionality and the intra-observation dependence of functional data pose challenges both for theory and computation.

Recently, various approaches and statistical models for the analysis of functional data have been developed. For an introduction and summary, see [27,28,29,30]. Ref. [31] firstly proposed a linear regression model and analyzed the effects of functional independent variables on the scalar response variables through the inner product of functional independent variables and unknown nonparametric coefficient function. Ref. [32] gave a functional linear semiparametric quantile regression model, which has been used to analyze ADHD-200 patients data. Ref. [33] studied the estimation problem of a functional partial quantile regression model, and proved the asymptotic normality of the finite dimensional parameter estimation. The conventional method of functional data analysis is principal component analysis (PCA), such as [34,35]. Ref. [35] gave the optimal convergence rates of PCA. We will consider functional principal components analysis (FPCA) for our functional linear regression model and investigate the distributed learning with privacy.

In this paper, we propose a new ADMM-based distributed learning algorithm with differential privacy to handle large amounts of functional data. We call it the FDP-ADMM algorithm. Our proposed FDP-ADMM algorithm has good properties such as a faster rate of convergence, lower communication and computation costs, and better utility–privacy tradeoffs. In the FDP-ADMM algorithm, we consider a more robust quantile loss function, combine an approximate augmented Lagrange function, and integrate time-varying Gaussian noise into local learning on each machine. These techniques allow the FDP-ADMM algorithm to be adversarial while protecting privacy.

The main contributions of this paper are summarized as follows:

We propose a distributed learning algorithm (FDP-ADMM) that can process large-scale distributed functional data and protect privacy. For the large-scale functional data, we adopt functional principal component analysis to reduce the dimensions of the data, improve the quality of data information, and promote the efficiency of functional data analysis using distributed learning.
We introduce a quantile loss function for functional linear model such that our models are adaptive to heavy-tail data or outliers. Thus, our ADMM-based distributed learning algorithm is more robust compared with ordinary least square procedure.
The privacy and theoretical convergence guarantees of the FDP-ADMM algorithm are derived, and a privacy–utility trade-off is demonstrated: a weaker privacy guarantee would result in better utility.
We conduct numerical experiments to illustrate the effectiveness of FDP-ADMM in the framework of distributed learning. The results of experiments are consistent with our theoretical analysis.

The rest of this paper, is organized as follows. In Section 2, we state our problem formulation by introducing the functional linear regression model, the penalized quantile regression, the ADMM algorithm, and DP. In Section 3, we propose an ADMM-based distributed learning algorithm with privacy protection for distributed functional data analysis. In Section 4, we present the utility analysis of our algorithm, FDP-ADMM, including the convergence and privacy guarantee. In Section 5, we give some numerical experiments to verify our theoretical results. Some conclusions are given in Section 6. The proofs of the main results are collected in Appendix A.

Notations

For any positive integer n, we define

[n] : = {1, 2, \dots, n}

.

∥ \cdot ∥

,

{∥ \cdot ∥}_{2}

and

{∥ \cdot ∥}_{\infty}

are denoted as the Euclidean norm,

ℓ_{2}

-norm, and

ℓ_{\infty}

-norm, respectively.

ρ_{τ} (u) = u (τ - I {u \leq 0})

is the quantile loss function for a scalar

u \in R

. and

I {\cdot}

denotes the indicator function. For a vector

u = {(u_{1}, \dots, u_{n})}^{T}

, we define

ρ_{τ} (u) = \sum_{i = 1}^{n} ρ_{τ} (u_{i})

. Throughout this paper, the constant C denotes positive constant whose value may change from line to line. For any function f and a positive function

ϕ

, f

≍ ϕ

means

a ϕ < f < b ϕ

for some positive constants a and b.

2. Problem Formulation

In this section, we present the functional data model, quantile regression, ADMM algorithm, and difference privacy mechanisms to be studied.

2.1. Functional Data Analysis

Functional data consist of functions that are basically smooth but are usually corrupted with noise, such as curves, images, and so on. For simplicity, we assume the functional predictor

X (t)

on the finite time interval

I = [0, T] .

Let

{\{x_{i} (t) : t \in I\}}_{1 \leq i \leq n}

be observation variables; that is, for each

t \in I

, there exists an observed value

x_{i} (t) \in R

. A typical functional data set is:

\{x_{i} (t_{j, i}) \in R : t_{j, i} \in I, 1 \leq i \leq n, 1 \leq j \leq J_{i}\},

where

J_{i}

is the observation number, and n is the number of individuals. If

J_{i}

is small, then the data are called sparse; otherwise they are called dense. FDA pays attention to the shape of the potential function or curve of the data via some statistical models and estimation procedures.

Functional linear regression is a standard method in functional data analysis for incorporating functional predictors, which focuses on modeling the relationship between a functional or continuous response Y and a functional predictor

X (t)

, in which t varies in a compact set I. It usually has the form:

Y = β_{0} + \int_{I} β (t) X (t) d t + ϵ,

where

β_{0}

is a intercept term,

ϵ

is the random noise independent of

X (t)

, and

β (t)

is an unknown function of interest. Without loss of generality, we assume

E (Y) = 0

and

E (X (t)) = 0

. Based on data

{(X_{i} (t), Y_{i}), i = 1, \dots, n}

, the model becomes:

Y_{i} = \int_{I} β (t) X_{i} (t) d t + ϵ .

(1)

FPCA is commonly used for analyzing such models (1) with the purpose of dimension reduction. The main idea is to summarize the data variation and information via some dimensional loadings. Dimension reduction in FPCA is performed through an expansion of basis, which consists of the eigenfunctions formed by the covariance operator

Σ (\cdot, \cdot)

of the process

X (t) : t \in I

. By Mercer’s theorem, the spectral decomposition is

Σ (s, t) = C o v (X (s), X (t)) = \sum_{k = 1}^{\infty} λ_{k} ϕ_{k} (s) ϕ_{k} (t),

where

s, t \in I

,

λ_{k}

are the ordered eigenvalues such that

λ_{1} \geq λ_{2} \geq \dots

, and the function

ϕ_{k}

forms the orthogonal basis corresponding to

λ_{k}

,

k = 1, 2, \dots

.

By Karhunen and Loève [36], in the classical functional principal component analysis, the ith random curve

X (t)

and functional coefficient

β (t)

can be expressed as:

X_{i} (t) = \sum_{k = 1}^{\infty} A_{i k} ϕ_{k} (t), β (t) = \sum_{k = 1}^{\infty} w_{k} ϕ_{k} (t),

where the coefficients

A_{i k} = \int_{I} X_{i} (t) ϕ_{k} (t) d t

and

w_{k} = \int_{I} β (t) ϕ_{k} (t) d t

are the functional principal components. In addition,

E (A_{i k}) = 0

and var

(A_{i k}) = λ_{k}

for

k = 1, 2, \dots

. We have the top K of

λ_{k}

and taking their corresponding

ϕ_{k} (t)

, we have an approximation to

X_{i} (t)

truncated as:

X_{i}^{K} (t) = \sum_{k = 1}^{K} A_{i k} ϕ_{k} (t) .

Then we consider the functional principal component regression into the functional linear model:

Y_{i} = \int_{I} X_{i} (t) β (t) d t + ϵ_{i} \approx \sum_{k = 1}^{K} A_{i k} w_{k} + ϵ_{,} f o r i = 1, \dots, n,

(2)

We regard

w_{k}

(k = 1, \dots, K)

as unknown parameters. We can select a proper K FPCA basis to represent the functional data

X_{i} (t)

; that is, the most important information of data can be refined by FPCA.

2.2. Quantile Regression with Penalties

We have i.i.d observations

(A_{i}, y_{i}), i = 1, 2, \dots, n

, with

A_{i} = (A_{i 1}, \dots, A_{i K})

. Let

Q_{y_{i} ∣ A_{i}} (τ) = A_{i} w_{τ}

be the conditional quantile of

y_{i}

on

A_{i} \in R^{K}

, for a give the quantile level

τ

th,

τ

∈(0,1). Let

w_{τ} = {(w_{1, τ}, \dots, w_{K, τ})}^{T}

. The quantile regression estimate of

w_{τ}

is defined as:

{\hat{w}}_{τ} = \underset{w \in R^{K}}{arg min} ρ_{τ} (y - A w),

(3)

where

A = {[A_{1}, \dots A_{n}]}^{T} \in R^{n \times K}

is a matrix,

ρ_{τ} (u) = u (τ - I {u \leq 0})

is the quantile loss function for a scalar

u \in R

and

I {\cdot}

denotes the indicator function.

Penalized quantile regression (PQR) is formulated as

min_{w} ρ_{τ} (y - A w) + P_{λ} (w),

(4)

where

P_{λ} (\cdot)

is a penalty, such as lasso penalty [37],

P_{λ} (w) = λ {∥w∥}_{1};

(5)

or elastic net [38],

P_{λ} (w) = λ (λ_{2} {∥w∥}_{2}^{2} + λ_{1} {∥w∥}_{1}), λ_{1}, λ_{2} \geq 0 .

(6)

Penalized QR, such as (5) and (6), leads to biased estimators. To obtain an unbiased estimator, refs. [39,40] proposed a non-convex penalty, for instance, the MCP penalty [39], or the SCAD penalty [40].

Our learning empirical loss is:

\hat{L} (w) : = \frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - A_{i} w)

(7)

based on all data coming from all machines. As the data are distributed on local machines, it is difficult to collect all data into one machine, and additionally, there are privacy issues. Therefore, we apply the technique of distributed learning via the ADMM algorithm for the following distributed empirical loss (11). ADMM is a computational framework to solving optimization problems.

2.3. ADMM Algorithm

ADMM algorithm was first proposed by [41,42] in 1975 and 1976, respectively. Then, ADMM was reviewed and proven to be suitable for large-scale distributed optimization by [16]. In this section, we give the basic ADMM formulation.

Assume that our optimization problem is expressed as:

min_{x, z} {f (x) + g (z)} s . t . B x + C z = d

(8)

where

x \in R^{n}, z \in R^{s}

, matrices

B \in R^{m \times n}

and

C \in R^{m \times s}

, vector

d \in R^{m}

, and functions

f : R^{n} \to R

and

g : R^{s} \to R

. x and z are the variables needing to be optimized. The optimization problem (8) consists of two parts:

f (x)

related to variable x and

g (z)

related to variable z. This structure can easily be dealt with via ADMM as follows: First, we have the augmented Lagrangian function:

L_{ρ} ((x, z), u) : = f (x) + g (z) + u^{T} (B x + C z - d) + \frac{ρ}{2} {∥ B x + C z - d ∥}_{2}^{2},

(9)

where u is the dual variable (or called the Lagrange multiplier), and

ρ > 0

is a penalty parameter. The name ‘augmented’ in the

L_{ρ}

refers to the quadratic penalty term

\frac{ρ}{2} {∥ B x + C z - d ∥}_{2}^{2}

, which is added for better convergence properties of algorithm. Then, the ADMM iterative solution of the optimization problem (9) is:

\begin{matrix} x^{l + 1} & : = arg min_{x} L_{ρ} (x, z^{l}, u^{l}), \\ z^{l + 1} & : = arg min_{z} L_{ρ} (x^{l + 1}, z, u^{l}), \\ u^{l + 1} & : = u^{l} + ρ (B x^{l + 1} + C z^{l + 1} - d) . \end{matrix}

(10)

The ‘multiplier method’ in ADMM refers to a dual ascent using augmented Lagrange function (with quadratic penalty term), and the ‘alternating direction’ refers to variables

x

and

z

be updated alternately. For more theories and applications about ADMM, refer to [16].

2.4. Differential Privacy

Differential privacy technology was originally designed to confront differential attacks problem. The traditional protecting method is to anonymize or encrypt to the datasets. However, some individual information can still be recovered from these anonymous data, based on certain algorithms, such as the recommendation algorithm. Therefore, Ref. [3] proposed the mechanism of differential privacy to protect privacy, which adds a designed noise in the algorithm so that attackers can not recover data information. Moreover, it has been proven that as long as the noise satisfies the differential privacy mechanism, no matter how much prior information the attacker has, the anonymous data cannot be reconstructed.

This paper mainly studies differential privacy for ADMM against adversarial attacks. Intuitively speaking, if an adversary can not tell whether a individual datum x belongs to the special data set X or not, when we output results from algorithm

M (X)

. We call that DP. Now, we give the definition of the

(ϵ, δ)

-differential privacy from Dwork’s work [2].

Definition 1

(

(ϵ, δ)

-Differential Privacy). A randomized algorithm

M : X^{n} \to R

is

(ϵ, δ)

-differential private if for any two adjacent datasets (differing in only one tuple) X,

X^{'} \in X^{n}

, and for any measurable output subset

S \subseteq range (M)

:

P [M (X) \in S] \leq e^{ϵ} \cdot P [M (X^{'}) \in S] + δ,

where probability measure

P

, which is bounded, only depends on the randomness of algorithm

M

.

In Definition 1,

δ

and

ϵ

measure the protection strength of the privacy. It implies that a smaller

δ

or a smaller

ϵ

gives better privacy protection. The Laplace and Gaussian mechanisms are two typical methods that are widely used in

(ϵ, δ)

-differential privacy. They offer calibrated noise sampled from Laplace or Gaussian distribution, and add this noise into the algorithm. Now we consider a class of deferentially private algorithms via compositions, termed ‘k-fold adaptive composition’ in the literature. The advanced composition stated below, where the auxiliary inputs of the k-th algorithm are the outputs of all previous algorithms, shows how privacy parameters degrade as private algorithms are composited.

Lemma 1

(Theorem 4 in [43] (Advanced Composition)). Let

ϵ, δ \geq 0

. The class of

(ϵ, δ)

-differentially private algorithms satisfies

(ϵ^{'}, δ)

-differential privacy under k-fold adaptive composition, where

ϵ^{'} = c_{0} \sqrt{k} ϵ

for some constant

c_{0}

.

3. Distributed Learning with DP for Functional Data via ADMM

In this section, we propose the ADMM-based distributed learning algorithm with DP for functional data, which is called FDP-ADMM. We will transfer the functional regression model (1) into a linear regression model (2) by using FPCA. Because quantile regression has better performance in estimation and prediction for non-Gaussian distribution error, such as heavy-tailed distribution or outliers, we consider quantile regression for the models (1) and (2). First, based on data

{(X_{i} (t), y_{i}), i = 1, \dots, n}

, the functional quantile linear regression model we consider is

y_{i} = \int_{I} β_{τ} (t) X_{i} (t) d t + ϵ_{τ, i},

where

τ \in (0, 1)

is a given quantile level, and

ϵ_{τ, i}

are random errors. Without loss of generality, we assume that the

τ

th quantile of

ϵ_{τ, i}

is equal to zero. The model can be written as

Q_{y_{i} | X_{i}} (τ) = \int_{I} β_{τ} (t) X_{i} (t) d t

. Our goal is to learn the functional coefficient

β_{τ} (t)

by

min_{β_{τ} (t)} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - \int_{I} β_{τ} (t) X_{i} (t) d t) .

The problem is difficult because of the term

\int_{I} β_{τ} (t) X_{i} (t) d t

. By the FPCA introduced in Section 2.1, we have the model (2):

y_{i} = \int_{I} X_{i} (t) β_{τ} (t) d t + ϵ_{τ, i} \approx \sum_{k = 1}^{K} A_{i k} w_{τ, k} + ϵ_{τ, i}, f o r i = 1, \dots, n .

That is, functional quantile linear regression is transformed as an ordinary quantile linear regression. We suppress the dependency of

w_{τ, k}

on

τ

for simplicity. Then, for the quantile linear regression, we propose penalized quantile regression learning, which is formulated as

min_{w} ρ_{τ} (y - A w) + P_{λ} (w) .

It has been introduced in Section 2.2.

However, our dataset

{(X_{i} (t), y_{i}), i = 1, \dots, n}

cannot be collected on one machine, but distributed over M machines. That is, we have the distributed data

{(X_{i, j} (t), y_{i, j}), i = 1, \dots, M, j = 1, \dots, m_{i}}

, where M is the number of worker machines, and

m_{i}

is the size of sample on the ith machine. Thus, based on FPCA,

{(X_{i, j} (t), y_{i, j}), i = 1, \dots, M, j = 1, \dots, m_{i}}

is transformed as

{(y_{i j}, A_{i, j}), i = 1, \dots, M, j = 1, \dots, m_{i}}

, where

A_{i, j} = (A_{i, j 1}, \dots, A_{i, j K})

is the score of the jth sample on the ith worker machine. So, based on the distributed data

{(y_{i j}, A_{i, j}), i = 1, \dots, M, j = 1, \dots, m_{i}}

and the model (2), we have the following QR estimation in the distributed framework:

\begin{matrix} {\hat{w}}_{τ} = argmin \sum_{i = 1}^{M} (\sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i, j} w)), \end{matrix}

(11)

where

ρ_{τ} (\cdot)

is the loss function of quantile regression, and w is the unknown coefficient. Furthermore, we modify QR as penalized QR estimator for achieving faster shrinking, that is:

\begin{matrix} {\hat{w}}_{τ} = argmin \sum_{i = 1}^{M} (\sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i, j} w)) + P_{λ} (w) . \end{matrix}

(12)

Note that in (12), there exist two type of tuning parameters, K and

λ

, where K controls the number of scores to characterize the decomposition level of FPCA and

λ

controls the fitness of the model.

When facing big data, that is, when n is very large, it is hard for one machine to learn w in (12). So, it is necessary to distributed storage and learning. Next, we will demonstrate the ADMM-based distributed learning for penalized QR. We provide a sketch of our FDP-ADMM algorithm based on the distributed data.

3.1. ADMM-Based Distributed Learning Algorithm

Assume we have M machines, and the ith machine has

m_{i}

local data samples. Applying the ADMM algorithm, we re-formulate the problem (12) as:

\begin{matrix} min_{{\{w_{i}\}}_{i \in [M]}} & \sum_{i = 1}^{M} (\sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i, j} w_{i}) + \frac{λ}{M} P (w_{i})), \\ s . t . & w_{i} = w, i = 1, \dots, M, \end{matrix}

(13)

where

w_{i} \in R^{K}

is the local model parameters, and

w \in R^{K}

is the global ones. Then, the augmented Lagrangian function for the ith machine is:

\begin{matrix} L_{ρ, i} (w_{i}, w, γ_{i}) = \sum_{j = 1}^{m_{i}} & \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i, j} w_{i}) + \frac{λ}{M} P (w_{i}) - 〈γ_{i}, w_{i} - w〉 + \frac{ρ}{2} {∥w_{i} - w∥}^{2} \end{matrix}

(14)

\begin{matrix} s . t . w_{i} = w, i = 1, 2, \dots, M . \end{matrix}

(15)

The objective (14) is decoupled and each worker only needs to minimize the sub-problem based on its local data set. Constraints (15) enforce all the local models to consensus. It results in the following iteration:

\begin{matrix} w_{i}^{l} = \underset{w_{i}}{argmin} {\hat{L}}_{ρ, i} (w_{i}, w^{l - 1}, γ_{i}^{l - 1}), \end{matrix}

(16)

\begin{matrix} w^{l} = \frac{1}{M} \sum_{i = 1}^{M} w_{i}^{l} - \frac{1}{M} \sum_{i = 1}^{M} γ_{i}^{l - 1} / ρ, \end{matrix}

(17)

\begin{matrix} γ_{i}^{l} & = γ_{i}^{l - 1} - ρ (w_{i}^{l} - w^{l}) . \end{matrix}

(18)

Note that each machine transfers its

(w_{i}^{l}, γ_{i}^{l})

to a central machine. The central machine gathers them to update

w^{l}

and then broadcasts it to each machine. Details for the algorithm are present in Algorithm 1. Based on output

w^{L}

, we obtain:

{\hat{β}}_{N - D P} (t) = \sum_{k = 1}^{K} w_{k}^{L} ϕ_{k} (t) .

Algorithm 1:ADMM for PQR of Functional Data (F-ADMM)

3.2. ADMM-Based Distributed Learning with DP

For achieving faster optimization, we make use of the first-order approximation to the penalized objective function. Then,

L_{ρ, i} (w_{i}, w, γ_{i})

in (14) becomes:

\begin{matrix} {\hat{L}}_{ρ, i} (w_{i}, {\tilde{w}}_{i}^{l - 1}, w, γ_{i}) & = & \sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i, j} {\tilde{w}}_{i}^{l - 1}) + \frac{λ}{M} P_{} ({\tilde{w}}_{i}^{l - 1}) \\ + 〈\sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ}^{'} (y_{i j} - A_{i, j} {\tilde{w}}_{i}^{l - 1}) + \frac{λ}{M} P_{}^{'} ({\tilde{w}}_{i}^{l - 1}), w_{i} - {\tilde{w}}_{i}^{l - 1}〉 \\ - 〈γ_{i}, w_{i} - w〉 + \frac{ρ}{2} {∥w_{i} - w∥}^{2} + \frac{{∥w_{i} - {\tilde{w}}_{i}^{l - 1}∥}^{2}}{2 η_{i}^{l}}, \end{matrix}

(19)

where

η_{i}^{l} \in R

is the time-varying step size which decreases as the iteration l grows,

ρ_{τ}^{'}

is the subgradient of the quantile loss function, and

P_{}^{'}

is the subgradient of the penalty. So, we have the following optimization problem:

\begin{matrix} min_{w_{i}} \sum_{i = 1}^{M} {\hat{L}}_{ρ, i} (w_{i}, {\tilde{w}}_{i}^{l - 1}, w, γ_{i}) s . t . w_{i} = w, i = 1, 2, \dots, M . \end{matrix}

(20)

Here, we give the ADMM-based distributed learning algorithm with DF (FDP-ADMM) as follows:

\begin{matrix} w_{i}^{l} & = \underset{w_{i}}{argmin} {\hat{L}}_{ρ, i} (w_{i}, {\tilde{w}}_{i}^{l - 1}, w^{l - 1}, γ_{i}^{l - 1}), \end{matrix}

(21)

\begin{matrix} {\tilde{w}}_{i}^{l} & = w_{i}^{l} + ξ_{i}^{l}, \end{matrix}

(22)

\begin{matrix} w^{l} & = \frac{1}{M} \sum_{i = 1}^{M} {\tilde{w}}_{i}^{l} - \frac{1}{M} \sum_{i = 1}^{M} γ_{i}^{l - 1} / ρ, \end{matrix}

(23)

\begin{matrix} γ_{i}^{l} & = γ_{i}^{l - 1} - ρ ({\tilde{w}}_{i}^{l} - w^{l}), \end{matrix}

(24)

where

ξ_{i}^{l}

in (22) are sampled from

N (0, σ_{i, l}^{2} I_{K})

, and

w^{l}

in (23) is computed on the central machine. The rest are processed at each local machine. Details on FDP-ADMM algorithm are presented in Algorithm 2. Based on output

w^{L}

of Algorithm 2, we obtain:

{\hat{β}}_{D P} (t) = \sum_{k = 1}^{K} w_{k}^{L} ϕ_{k} (t) .

Note that the central machine initializes the global

w^{0}

, while each worker machine initializes their own variables: the noisy primal variables

\{{\tilde{w}}_{i}^{0}\}

and the dual variables

\{γ_{i}^{0}\}

for

i \in [M]

.

ξ_{i}^{l}

is Gaussian noise with zero-mean and variance

σ_{i, l}^{2}

, where

σ_{i, l}^{2}

is obtained based on the Gaussian mechanism of DP, which is given in Theorem 1. Each worker machine updates its noisy primal variable

{\tilde{w}}_{i}^{l}

based on (22). Then, the central machine receives all noisy primal variables

{\{{\tilde{w}}_{i}^{l}\}}_{i \in [M]}

and the dual variables

{\{γ_{i}^{l}\}}_{i \in [M]}

from the worker machine, and updates a global variable

w^{l}

. In addition,

w^{l}

on central machine broadcasts to every worker machine to update the final dual variables

{\{γ_{i}^{l}\}}_{i \in [M]}

using (24). It is an iterative cycle.

We set the variance

σ_{i, l} = \frac{2 c_{1} \sqrt{2 ln (1.25 / δ)}}{m_{i} ϵ (ρ + 1 / η_{i}^{l})}

for obtaining the

(δ, ϵ)

-DP of the FDP-ADMM algorithm, which is set based on the Gaussian mechanism of DP.

σ_{i, l}^{2}

is time-varying; that is, it decreases as iteration l increases. The motivation of using time-varying variance in the Gaussian mechanism is to reduce the negative impact of noise and ensure the convergence of the algorithm. We find that the negative impact will be mitigated by the method of decreasing noise and can achieve a stable solution.

For the communication and computation costs of our algorithms 1 and 2, here are some remarks. We know that it is unrealistic to send the estimator of

β (t)

,

t \in [0, 1]

on a worker machine to the central machine because it is infinite dimensional. In Algorithms 1 and 2, we only transmit the K-dimensional w in each round of communication, so the communication complexity is only

O (K)

. In practice of functional data analysis, K is usually a small number such as 5, 10, etc. Therefore, our algorithms are communication-efficient because of the low communication costs. In addition, in each round of learning, each worker machine learns its own low-dimensional parameter w based on its local data, then the central machine is responsible for summarizing these parameters from worker machines. This working mechanism greatly reduces the computational costs.

Algorithm 2:ADMM-based distributed learning with DP for PQR of Functional Data (FDP-ADMM)

4. Utility Analysis

4.1. Privacy Guarantee

In the section, we will analyze the privacy guarantee of the proposed FDP-ADMM algorithm. During traditional parameter transmission, the shared information

{\{w_{i}^{l}\}}_{l \in [L]}

can divulge the sensitive messages of original data. So, it is necessary to show outputs

{\{{\tilde{w}}_{i}^{l}\}}_{l \in [L]}

with differential privacy.

Denote the two neighboring datasets

A_{i}

and

A_{i}^{'}

. So, the

w_{i, A_{i}}^{l}

and

w_{i, A_{i}^{'}}^{l}

are the primal variables obtained from every local worker machine. From the FDP-ADMM algorithm, we add noise to

w_{i}^{l}

by Gaussian mechanism. A fundamental tool used in DP is sensitivity. We use

l_{2}

-norm sensitivity. Due to the application of first-order approximation in the augmented Lagrange function, the proposed algorithm does not require the smoothness and strong convexity assumptions to the objective function for proving the sensitivity.

First, we give a lemma, which gives an

l_{2}

-norm sensitivity of

w_{i}^{l}

under the sub-gradient

ℓ^{'}

of loss function ℓ, bounded.

Lemma 2.

Assume that

{∥ℓ^{'} (\cdot)∥}_{2} \leq c_{1}

. The

l_{2}

-norm sensitivity of the local primal variable

w_{i}^{l}

update function is given by:

max_{A_{1}, A_{1}^{'}} ∥w_{i, A_{i}}^{l} - w_{i, A_{i}^{'}}^{l}∥ = \frac{2 c_{1}}{m_{i}^{2} (ρ + 1 / η_{i}^{l})} .

Its proof is given in Appendix A. Lemma 2 shows that the

l_{2}

sensitivity of

w_{i}^{l}

is affected by the time-varying

η_{i}^{l}

. We set

η_{i}^{l}

as a decreasing function of l, so the

l_{2}

sensitivity decreases with increasing l. That is, if

ϵ

and

δ

is fixed, the added noise in the proposed algorithm will become smaller as the l increases. Therefore, the algorithm will be stably convergent in spite of adding the noise. Then, we show that Algorithm 2 guarantees

(ϵ, δ)

-differential privacy.

Theorem 1.

Assume that

∥ {>A}_{i j} ∥_{2} \leq c_{1}

,

i = 1, \dots, M

and

j = 1, \dots, m_{i}

in the model (11). Let

ϵ \in (0, 1]

be arbitrary and

ξ_{i}^{k}

be the noise sampled from Gaussian mechanism with variance

σ_{i, k}^{2}

, where

σ_{i, k} = \frac{2 c_{1} \sqrt{2 ln (1.25 / δ)}}{m_{i} ϵ (ρ + 1 / η_{i}^{k})} .

The FDP-ADMM guarantees

(ϵ, δ)

-differential privacy. Specifically, for any neighboring datasets

A_{i}

and

A_{i}^{'}

, for any output

{\tilde{w}}_{i}^{k}

, the following inequality always holds:

Pr [{\tilde{w}}_{i}^{k} ∣ A_{i}] \leq e^{ϵ} \cdot Pr [{\tilde{w}}_{i}^{k} ∣ A_{i}^{'}] + δ .

Its proof is given in Appendix A.

4.2. Convergence of the FDP-ADMM Algorithm

The convergence of ADMM for convex problems has been widely studied in recent years. Under the requirement of high precision, the convergence of ADMM goes very slowly. However, under the requirement of medium precision, the convergence speed of ADMM is acceptable, and the global solution can be achieved by dozens of iterations. Furthermore, Ref. [44] showed that ADMM could attain a global linear convergence on strict convexity and Lipschitz gradient, especially when matrix B and C in (8) are full column rank. The ADMM framework is suitable for large-scale statistical learning problems. More convergence analysis of ADMM under convexity were studied by [45,46,47,48,49,50,51,52], and so on. Ref. [53] proposed an approximate ADMM algorithm to make it converge to the stable point with a large enough penalty parameter. Ref. [54] gave the convergence of quantile regression using ADMM for convex and non-convex penalties. We refer to [54] for the convergence of our FDP-ADMM algorithm.

We define

w^{*}

as the optimal solution of (13) and

c_{w} = {∥w^{*}∥}_{2}

. The convergence of the algorithm is based on the fact that the quantile loss function is convex and non-smooth. For simplicity of analysis, we define some notations as follows:

\begin{matrix} f_{i} (w_{i}) = \sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i, j} w_{i}) + \frac{λ}{M} P (w_{i}), \\ {\bar{w}}^{L} = \frac{1}{L} \sum_{l = 1}^{L} w^{l}, {\bar{γ}}_{i}^{L} = \frac{1}{L} \sum_{l = 1}^{L} γ_{i}^{l}, {\bar{w}}_{i}^{l} = \frac{1}{L} \sum_{l = 0}^{L - 1} {\tilde{w}}_{i}^{l}, \\ u_{i}^{l} = [\begin{matrix} {\tilde{w}}_{i}^{l} \\ w^{l} \\ γ_{i}^{l} \end{matrix}], u_{i} = [\begin{matrix} w_{i} \\ w \\ γ_{i} \end{matrix}], F (u_{i}^{l}) = [\begin{matrix} - γ_{i}^{l} \\ γ_{i}^{l} \\ {\tilde{w}}_{i}^{l} - w^{l} \end{matrix}] . \end{matrix}

We analyze the convergence of our proposed algorithm in terms of both the objective value and the constraint violation as [55]:

\sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w^{*}) + g ∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥),

where

\sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w^{*}))

is used to measure the distance between the current objective value and the optimal value, and

\sum_{i = 1}^{M} g ∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥

depicts the difference between the local model and the global one. If the training result of our FDP-ADMM algorithm achieves optimal and local models, it obtains consensus.

Lemma 3

([55], lemma 2). Assume

ρ_{τ} (\cdot)

and

P (\cdot)

are convex. For any

l \geq 1

, we have:

\begin{matrix} \sum_{i = 1}^{M} (f_{i} ({\tilde{w}}_{i}^{l - 1}) - f_{i} (w_{i}) + {(u_{i}^{l} - u_{i})}^{⊤} F (u_{i}^{l})) \\ \leq \sum_{i = 1}^{M} & (\frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2} - \frac{ρ}{2} {∥w_{i} - w^{l}∥}^{2} + \frac{ρ}{2} {∥w_{i} - w^{l - 1}∥}^{2} \\ - (ρ + 1 / η_{i}^{l}) 〈ξ_{i}^{l}, w_{i} - {\tilde{w}}_{i}^{l - 1}〉 + \frac{1}{2 η_{i}^{l}} {∥w_{i} - {\tilde{w}}_{i}^{l - 1}∥}^{2} - \frac{1}{2 η_{i}^{l}} {∥w_{i} - {\tilde{w}}_{i}^{l}∥}^{2} \\ + \frac{1}{2 ρ} {∥γ_{i} - γ_{i}^{l - 1}∥}^{2} - \frac{1}{2 ρ} {∥γ_{i} - γ_{i}^{l}∥}^{2}) . \end{matrix}

Based on Lemma 3, we have:

Theorem 2.

Assume that

∥ A_{i j} ∥_{2} \leq c_{1}

,

i = 1, \dots, M

, and

j = 1, \dots, m_{i}

in the model (11);

P (\cdot)

are convex; and

∥P^{'} (\cdot)∥ \leq c_{2}

. The domain of the dual variable is bounded, namely,

∥γ_{i}∥ \leq g

. We set the learning rate as:

η_{i}^{l} = \frac{c_{w}}{\sqrt{2 l}} {({(c_{1} + λ c_{2} / M)}^{2} + \frac{8 K c_{1}^{2} ln (1.25 / δ)}{m_{i}^{2} ϵ^{2}})}^{- \frac{1}{2}} .

For any

L \geq 1

and g, we have:

\begin{matrix} E [\sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}^{*}) + g ∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥ \\ \leq & \sum_{i = 1}^{M} \frac{c_{w}}{\sqrt{L}} \sqrt{2 {(c_{1} + λ c_{2} / M)}^{2} + \frac{16 K c_{1}^{2} ln (1.25 / δ)}{m_{i}^{2} ϵ^{2}}} + \frac{M (ρ c_{w}^{2} + g^{2} / ρ)}{2 L} . \end{matrix}

Its proof is given in Appendix A. Theorem 2 shows our approach achieves the rate of convergence at

O (1 / \sqrt{L})

, and gives an explicit utility-privacy trade-off of our FDP-ADMM algorithm. For the larger

ϵ

or

δ

, our algorithm has better utility. Note that the larger

ϵ

or

δ

means the weaker privacy-preserving ability.

5. Simulation Study

In this section, we illustrate the performance of the proposed privacy-protection FDP-ADMM algorithm using a simulated study.

The simulation design is described as follows:

y_{i} = \int_{I} β (t) X_{i} (t) d t + ϵ_{τ i}, i = 1, \dots, n,

where n is the sample size on all worker machines;

β (t) = \sum_{k = 1}^{50} w_{k} ϕ_{k} (t),

with

w_{1} = 0.3

,

w_{k} = 4 {(- 1)}^{k + 1} k^{- 2}

for

k \geq 2

,

ϕ_{1} (t) \equiv 1

and

ϕ_{k} (t) = 2^{1 / 2} cos ((k - 1) π t)

for

k \geq 2

;

X_{i} (t) = \sum_{k = 1}^{50} A_{i k} ϕ_{k} (t),

where

A_{i k}

’s are independent and normal

N (0, k^{- 2})

; and the errors

ϵ_{τ i} = ϵ_{i} - F_{ϵ}^{- 1} (τ),

where

F_{ϵ}

is the distribution function of

ϵ_{i}

, take

ϵ \sim t (3)

. Note that

F_{ϵ}^{- 1}

is subtracted from

ϵ_{i}

to make the

τ

th quantile of

ϵ_{τ i}

zero for identifiability. The datasets

{(y_{i}, X_{i} (t))}_{i = 1}^{n}

are distributed on M worker machines, and each machine has the same sample size m. So,

n = M m

.

In the simulation, we set

n = 100,000

samples, and randomly split the dataset into

M = 10

, 20, and 50 groups to simulate the distributed learning condition. We take the penalty parameter

ρ = 0.1

, and the regularized parameter

λ = 0.05

. We set the level of quantile

τ = {0.1, 0.25, 0.5, 0.75, 0.9}

, and set privacy budget per iteration

ϵ = {0.01, 0.05, 0.1, 0.2}

and

δ = \{10^{- 3}, 10^{- 4}, 10^{- 5}, 10^{- 6}\}

. For each scenario, we run the algorithm 100 times. We consider our FDP-ADMM algorithm with typical

l_{1}

-norm and

l_{2}

-norm penalties and then assess it in terms of convergence and accuracy.

First, we report the mean integrated squared error (MISE) of the estimator

{\hat{β}}_{D P} (t)

computed on a grid of 100 equally spaced points on

I = [0, 1]

, that is:

MISE = E (\int_{0}^{1} {({\hat{β}}_{D P} (t) - β (t))}^{T} ({\hat{β}}_{D P} (t) - β (t)) d t) .

Second, based on

{\bar{w}}_{i}^{l}

and

{\bar{w}}^{l}

, we evaluate the convergence properties of the FDP-ADMM algorithm with respect to the augmented objective value, which measures the loss as well as the constraint penalty and is defined as:

\sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{l}) + ρ ∥{\bar{w}}_{i}^{l} - {\bar{w}}^{l}∥) .

Third, we evaluate the accuracy by empirical loss:

\frac{1}{M} \sum_{i = 1}^{M} \sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ρ_{τ} (y_{i j} - A_{i j} {\tilde{w}}_{i}^{k}) .

5.1. L1-Regularized Quantile Regression

We obtain the FDP-ADMM steps for the

l_{1}

-norm quantile regression by:

\begin{matrix} w_{i}^{l} & = (\frac{1}{m_{i}} \sum_{j = 1}^{m_{i}} A_{i, j}^{T} (τ - I_{{y_{i, j} - A_{i, j} {\tilde{w}}_{i}^{l - 1} \leq 0}}) - \frac{λ}{M} sgn ({\tilde{w}}_{i}^{l - 1}) \\ + γ_{i}^{l - 1} + ρ w^{l - 1} + {\tilde{w}}_{i}^{l - 1} / η_{i}^{l}) {(ρ + 1 / η_{i}^{l})}^{- 1}, \\ {\tilde{w}}_{i}^{l} & = w_{i}^{l} + N (0, σ_{i, l}^{2} I_{K}), \\ w^{l} & = \frac{1}{M} \sum_{i = 1}^{M} {\tilde{w}}_{i}^{l} - \frac{1}{M} \sum_{i = 1}^{M} γ_{i}^{l - 1} / ρ, \\ γ_{i}^{l} & = γ_{i}^{l - 1} - ρ ({\tilde{w}}_{i}^{l} - w^{l}), \end{matrix}

where

sgn (\cdot)

is the sign function. Since the objective function is convex but non-smooth, we use Theorem 2 to set

η_{i}^{l}

and apply Theorem 1 to set

σ_{i, l}

.

First, we list MISEs for the number of local machines,

M = 10

, 20, 50, in Table 1, Table 2 and Table 3, respectively. From Table 1, Table 2 and Table 3, we observe that (i) Our approach with larger

ϵ

and larger

δ

has better convergence for all quantile levels because their MISEs are smaller, which also implies weaker privacy protection. When

ϵ = 0.8

and

δ = 0.001

, the MISEs of our FDP-ADMM algorithm with privacy policy are comparable to the ones of non-DP algorithm (

δ = \infty

), that is, our FDP-ADMM algorithm does not sacrifice the estimation accuracy under weak privacy protection. (ii) For strong privacy protection, such as

ϵ = 0.1

, the accuracy of our training model decreases as the number of machines increases. Because the size of the local dataset is smaller for a larger number of working machines, more noise should be added into the FDP-ADMM algorithm to obtain a higher privacy guarantee. For the large number of machines, a high estimation accuracy can be achieved by reducing privacy protection, for example,

M = 50

. (iii) Our FDP-ADMM algorithm has a trade off between privacy and accuracy, i.e., the stronger the privacy protection, the lower the estimation accuracy. (iv) When

τ = 0.5

, this FDP-ADMM is a robust distributed learning algorithm for all parameters of privacy and number of machines we set. Because

τ

is farther from 0.5, its MISE is worse.

Second, we study the training performance (empirical loss) v.s. different number of distributed data sources under different levels of privacy protection when

τ = 0.5

. See Figure 1. Figure 1 shows that the accuracy of our training model will be reduced if more local machines are used. Since the number of agents is larger the smaller the size of the local dataset is, more noise should be added to guarantee the same level of differential privacy. Thus, it results in reducing the performance of the trained model. This is consistent with Theorem 1 that the standard deviation of noises is scaled by

1 / m i

. From another perspective, when more local machines participate, a weaker privacy protection can obtain a higher estimation accuracy.

Third, we illustrate the convergence of the FDP-ADMM algorithm by demonstrating how the augmented objective value converges for different values of

ϵ

and

δ

. See Figure 2. Figure 2 shows our algorithm with larger

ϵ

and

δ

(which implies the weaker privacy protection) has better convergence. This result is consistent with Theorem 2.

Finally, we evaluate the performance of FDP-ADMM by empirical loss for different levels of privacy protection. See Figure 3. Figure 3 shows our approach has fast convergence property for all privacy policies. In addition, all results we obtained show the privacy–utility trade-off of our FDP-ADMM: better utility is achieved when privacy leakage increases.

5.2. L2-Regularized Quantile Regression

We obtain the FDP-ADMM steps for

l_{2}

-norm quantile regression as follows:

\begin{matrix} w_{i}^{l} & = (\frac{1}{m_{i}} \sum_{j = 1}^{m_{i}} A_{i, j}^{T} (τ - I_{{y_{i, j} - A_{i, j} {\tilde{w}}_{i}^{l - 1} \leq 0}}) - \frac{λ}{M} {\tilde{w}}_{i}^{l - 1} + \\ γ_{i}^{l - 1} + ρ w^{l - 1} + {\tilde{w}}_{i}^{l - 1} / η_{i}^{l}) {(ρ + 1 / η_{i}^{l})}^{- 1}, \\ {\tilde{w}}_{i}^{l} & = w_{i}^{l} + N (0, σ_{i, l}^{2} I_{K}), \\ w^{l} & = \frac{1}{M} \sum_{i = 1}^{M} {\tilde{w}}_{i}^{l} - \frac{1}{M} \sum_{i = 1}^{M} γ_{i}^{l - 1} / ρ, \\ γ_{i}^{l} & = γ_{i}^{l - 1} - ρ ({\tilde{w}}_{i}^{l} - w^{l}) . \end{matrix}

Similar to the setting of Section 5.1, we present results in Table 4, Table 5 and Table 6 and Figure 4, Figure 5 and Figure 6. We also obtain the same conclusion as in Section 5.1.

6. Conclusions

In the paper, we proposed an ADMM-based differential privacy learning algorithm on penalized quantile regression for functional data: FDP-ADMM. We first transform functional quantile regression into an ordinary linear regression model by functional principal analysis, and then design the FDP-ADMM algorithm by an approximate augmented Lagrange function, ADMM algorithm, and Gaussian mechanism with time-varying variance. The FDP-ADMM is a noise-resilient, convergent, and computationally effective distributed learning algorithm, even if for high privacy guarantee. Lastly, we obtain the estimation of coefficient function with privacy protection for functional quantile regression distributed model by the Karhunen and Loève expression. We also derived the privacy guarantee and theoretical convergence by the objective value and the constraint violation. The evaluations on simulation datasets have demonstrated the effectiveness of the FDP-ADMM algorithm, even if under high privacy protection, and have shown its privacy–utility trade-off: larger

ϵ

and larger

δ

, indicating weaker privacy guarantee, results in better utility.

Author Contributions

Conceptualization, X.Z.; methodology, Y.X.; validation, X.Z. and Y.X.; investigation, X.Z.; writing—original draft preparation, Y.X.; writing—review and editing, X.Z. and Y.X.; supervision, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Chinese National Social Science Fund (19BTJ034), the National Natural Science Foundation of China (12171242, 11971235, 12071220), and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (KYCX21_1940).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In the Appendix, we give the proofs of Lemma 2 and Theorems 1 and 2.

Appendix A.1. Proof of Lemma 2

Proof.

First, we have that

{\hat{L}}_{ρ, i} (w_{i}, {\tilde{w}}_{i}^{l - 1}, w^{l - 1}, γ_{i}^{l - 1})

is convex because it is a quadratic function of

w_{i}

. Thus, we have a closed-form solution:

\begin{matrix} w_{i, A_{i}}^{l} & = {(ρ + 1 / η_{i}^{l})}^{- 1} (- \sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ℓ^{'} ({\tilde{w}}_{i}^{l - 1}, A_{i}) - \frac{λ}{M} P^{'} ({\tilde{w}}_{i}^{l - 1}) + γ_{i}^{l - 1} + ρ w^{l - 1} + \frac{{\tilde{w}}_{i}^{l - 1}}{η_{i}^{l}}), \\ w_{i, A_{i}^{'}}^{l} & = {(ρ + 1 / η_{i}^{l})}^{- 1} (- \sum_{j = 1}^{m_{i}} \frac{1}{m_{i}} ℓ_{i}^{'} ({\tilde{w}}_{i}^{l - 1}, A_{i}^{'}) - \frac{λ}{M} P^{'} ({\tilde{w}}_{i}^{l - 1}) + γ_{i}^{l - 1} + ρ w^{l - 1} + \frac{{\tilde{w}}_{i}^{l - 1}}{η_{i}^{l}}) . \end{matrix}

Then, the

l_{2}

-norm sensitivity of primal variable

w_{i}^{l}

is:

\begin{matrix} max_{A_{i}, A_{i}^{'}} {∥w_{i, A_{i}}^{l} - w_{i, A_{i}^{'}}^{l}∥}_{2} = max_{A_{i}, A_{i}^{'}} \frac{∥ℓ^{'} ({\tilde{w}}_{i}^{l - 1}, A_{i}) - ℓ^{'} ({\tilde{w}}_{i}^{l - 1}, A_{i}^{'})∥}{m_{i} (ρ + 1 / η_{i}^{l})} \\ \leq \frac{2 {∥ℓ^{'} (\cdot)∥}_{2}}{m_{i} (ρ + 1 / η_{i}^{l})} . \end{matrix}

So, Lemma 2 holds. ☐

Appendix A.2. Proof of Theorem 1

Proof.

In our quantile loss, we have a subgradient of

ρ_{τ} (u)

,

ρ_{τ}^{'} (u) = τ - I_{{u \leq 0}}

, which is bounded. Based on

∥ A_{i j} ∥_{2}

, uniformly bounded, we have the subgradient of

ρ_{τ} (y_{i j} - A_{i j} w)

with regard to w, bounded. The privacy loss from

{\tilde{w}}_{i}^{l}

is calculated as:

|ln \frac{Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}]}{Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}^{'}]}| = |ln \frac{Pr [{\tilde{w}}_{i}^{l^{(h)}} ∣ A_{i}]}{Pr [{\tilde{w}}_{i}^{l^{(h)}} ∣ A_{i}^{'}]}| = |ln \frac{Pr [ξ_{i}^{l^{(h)}}]}{Pr [ξ_{i}^{l,^{' (h)}]}]}|,

where

ξ_{i}^{l^{(h)}}

and

ξ_{i}^{l^{' (h)}}

are the h-entry of

ξ_{i}^{l}

and

ξ_{i}^{l,^{'}}

and are sampled from

N (0, σ_{i, l}^{2})

. The numerator in the ratio above describes the probability of seeing

{\tilde{w}}_{i}^{l}

when the database is

A_{i}

, the denominator corresponds the probability of seeing this same value when the database is

A_{i}^{'}

. This leads to:

\begin{matrix} |ln \frac{Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}]}{Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}^{'}]}| = |\frac{1}{2 σ_{i, l}^{2}} ({∥ξ_{i}^{l^{(h)}}∥}^{2} - {∥ξ_{i}^{l,^{' (h)}}∥}^{2})| \\ = & |\frac{1}{2 σ_{i, l}^{2}} ({∥ξ_{i}^{l^{(h)}}∥}^{2} - {∥ξ_{i}^{l^{(h)}} + (w_{i, A_{i}}^{l^{(h)}} - w_{i, A_{i}^{'}}^{l^{(h)}})∥}^{2})| \\ = & |\frac{1}{2 σ_{i, l}^{2}} (2 ξ_{i}^{l^{(h)}} ∥w_{i, A_{i}}^{l^{(h)}} - w_{i, A_{i}^{'}}^{l^{(h)}}∥ + {∥w_{i, A_{i}}^{l^{(h)}} - w_{i, A_{i}^{'}}^{l^{(h)}}∥}^{2})| . \end{matrix}

Since

∥ℓ^{'} (\cdot)∥ \leq c_{1}

, according to Lemma 2, we have:

\begin{matrix} ∥ w_{i, A_{i}}^{l^{(h)}} - w_{i, A_{i}^{'}}^{l^{(h)}} ∥ < ∥ w_{i, A_{i}}^{l} - w_{i, A_{i}^{'}}^{l} ∥ \leq 2 c_{1} / (m_{i} (ρ + 1 / η_{i}^{l})) . \end{matrix}

Thus, by letting

σ_{i, l} = 2 c_{1} \sqrt{2 ln (1.25 / δ)} / (m_{i} ϵ (ρ + 1 / η_{i}^{l}))

, we have:

|ln \frac{Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}]}{Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}^{'}]}| \leq |\frac{ξ_{i}^{l^{(h)}} m_{i} (ρ + 1 / η_{i}^{l}) + c_{1}}{4 ln (1.25 / δ) c_{1} / ϵ^{2}}| .

When

|ξ_{i}^{l^{(h)}}| \leq (4 ln (1.25 / δ) c_{1} / ϵ - c_{1}) / (ϵ m_{i} (ρ + 1 / η_{i}^{l}))

,

|ln (Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}] / Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}^{'}])|

is bounded by

ϵ

. Next, we need to prove that

\begin{matrix} Pr [|ξ_{i}^{l^{(h)}}| > (4 ln (1.25 / δ) c_{1} / ϵ - c_{1}) / (ϵ m_{i} (ρ + 1 / η_{i}^{l}))] \leq δ, \end{matrix}

which requires

Pr [ξ_{i}^{l^{(h)}} >

(4 ln (1.25 / δ) c_{1} / ϵ - c_{1}) / (ϵ m_{i} (ρ + 1 / η_{i}^{l}))] \leq δ / 2

. According to the tail bound of normal distribution

N (0, σ_{i, l}^{2})

, we have

Pr [ξ_{i}^{l^{(h)}} > r] \leq \frac{σ_{i, l}}{r \sqrt{2 π}} e^{- r^{2} / 2 σ_{i, l}^{2}} .

By letting

r = (4 ln (1.25 / δ) c_{1} / ϵ - c_{1}) / (ϵ m_{i} (ρ + 1 / η_{i}^{l}))

in the above inequality, we have:

\begin{matrix} Pr [ξ_{i}^{l^{(h)}} > \frac{4 ln (1.25 / δ) c_{1} / ϵ - c_{1}}{m_{i} (ρ + 1 / η_{i}^{l})}] \\ \leq & \frac{2 \sqrt{2 ln (1.25 / δ)}}{(4 ln (1.25 / δ) - ϵ) \sqrt{2 π}} exp (- \frac{{(4 ln (1.25 / δ) - ϵ)}^{2}}{8 ln (1.25 / δ)}) . \end{matrix}

When

δ

is small

(\leq 0.01)

and

ϵ \leq 1

, we have:

\frac{2 \sqrt{2 ln (1.25 / δ)}}{(4 ln (1.25 / δ) - ϵ) \sqrt{2 π}} < \frac{1}{\sqrt{2 π}}

and

- \frac{{(4 ln (1.25 / δ) - ϵ)}^{2}}{8 ln (1.25 / δ)} < ln (\sqrt{2 π} \frac{δ}{2}) .

As a result, we have:

Pr [ξ_{i}^{l^{(h)}} > \frac{4 ln (1.25 / δ) c_{1} / ϵ - c_{1}}{m_{i} (ρ + 1 / η_{i}^{l})}] < \frac{δ}{2} .

So far, we have proved that

Pr [ξ_{i}^{l^{(h)}} > (4 ln (1.25 / δ) c_{1} / ϵ -

c_{1}) / (ϵ m_{i} (ρ + 1 / η_{i}^{l}))] \leq δ / 2

; thus, we can prove that

Pr [|ξ_{i}^{l^{(h)}}| > (4 ln (1.25 / δ) c_{1} / ϵ - c_{1}) / (ϵ m_{i} (ρ + 1 / η_{i}^{l}))] \leq δ

. Define:

\begin{matrix} D_{1} = \{ξ_{i}^{l^{(h)}} : |ξ_{i}^{l^{(h)}}| \leq \frac{4 ln (1.25 / δ) c_{1} / ϵ - c_{1}}{m_{i} (ρ + 1 / η_{i}^{l})}\}, \\ D_{2} = \{ξ_{i}^{l^{(h)}} : |ξ_{i}^{l^{(h)}}| > \frac{4 ln (1.25 / δ) c_{1} / ϵ - c_{1}}{m_{i} (ρ + 1 / η_{i}^{l})}\} . \end{matrix}

Therefore, we obtain:

\begin{matrix} Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}] = & Pr [w_{i, A_{i}}^{l^{(h)}} + ξ_{i}^{l^{(h)}} : ξ_{i}^{l^{(h)}} \in D_{1}] \\ + Pr [w_{i, A_{i}}^{l^{(h)}} + ξ_{i}^{l^{(h)}} : ξ_{i}^{l^{(h)}} \in D_{2}] \\ < & e^{ϵ} \cdot Pr [{\tilde{w}}_{i}^{l} ∣ A_{i}^{'}] + δ, \end{matrix}

which proves that each iteration of DP-ADMM guarantees

(ϵ, δ)

-differential privacy. ☐

Appendix A.3. Proof of Theorem 2

Proof.

According to the convexity of

f_{i} (\cdot)

, the monotonicity of the operator

F (\cdot)

, and applying Lemma 3, we have:

\begin{matrix} \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}) + {({\bar{u}}_{i}^{L} - u_{i})}^{⊤} F ({\bar{u}}_{i}^{L})) \\ = & \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}) + 〈- {\bar{γ}}_{i}^{L}, {\bar{w}}_{i}^{L} - w_{i}〉 + 〈{\bar{γ}}_{i}^{L}, {\bar{w}}^{L} - w〉 + 〈{\bar{γ}}_{i}^{L} - γ_{i}, {\bar{w}}_{i}^{L} - {\bar{w}}^{L}〉) \\ \leq & \frac{1}{L} \sum_{l = 1}^{L} \sum_{i = 1}^{M} (f_{i} ({\tilde{w}}_{i}^{l - 1}) - f_{i} (w_{i}) + {(u_{i}^{l} - u_{i})}^{⊤} F (u_{i}^{l})) \\ = & \frac{1}{L} \sum_{l = 1}^{L} \sum_{i = 1}^{M} (f_{i} ({\tilde{w}}_{i}^{l - 1}) - f_{i} (w_{i}) + 〈- γ_{i}^{l}, {\tilde{w}}_{i}^{l} - w_{i}〉 + 〈γ_{i}^{l}, w^{l} - w〉 + 〈γ_{i}^{l} - γ_{i}, {\tilde{w}}_{i}^{l} - w^{l}〉) \\ \leq & \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} (\frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2} - (ρ + 1 / η_{i}^{l}) 〈ξ_{i}^{l}, w_{i} - {\tilde{w}}_{i}^{l - 1}〉) \\ + \frac{1}{L} \sum_{i = 1}^{M} (\frac{1}{2 η_{i}^{L}} {∥w_{i} - {\tilde{w}}_{i}^{0}∥}^{2} + \frac{ρ}{2} {∥w_{i} - w^{0}∥}^{2} + \frac{1}{2 ρ} {∥γ_{i} - γ_{i}^{0}∥}^{2}) . \end{matrix}

Let

(w_{i}, w)

be the optimal solution

(w_{i}^{*}, w^{*})

in the above inequality. We obtain:

\begin{matrix} \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}^{*}) + 〈- {\bar{γ}}_{i}^{L}, {\bar{w}}_{i}^{L} - w_{i}^{*}〉 + 〈{\bar{γ}}_{i}^{L}, {\bar{w}}^{L} - w^{*}〉 + 〈{\bar{γ}}_{i}^{L} - γ_{i}, {\bar{w}}_{i}^{L} - {\bar{w}}^{L}〉) \\ \leq & \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} \frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2} - \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} (ρ + 1 / η_{i}^{l}) 〈ξ_{i}^{l}, w_{i}^{*} - {\tilde{w}}_{i}^{l - 1}〉 \\ + \frac{1}{L} \sum_{i = 1}^{M} \frac{c_{w}^{2}}{2 η_{i}^{L}} + \frac{M}{L} \frac{ρ}{2} c_{w}^{2} + \frac{1}{L} \sum_{i = 1}^{M} \frac{1}{2 ρ} {∥γ_{i} - γ_{i}^{0}∥}^{2} . \end{matrix}

The above inequality holds for all

γ_{i}

; thus, it also holds for

γ_{i} \in \{γ_{i} : ∥γ_{i}∥ \leq g\}

. By letting

γ_{i}

be the optimal solution, we have the maximum of the left side of the above inequality:

\begin{matrix} max_{\{γ_{i} : ∥γ_{i}∥ \leq g\}} \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}^{*}) + 〈- {\bar{γ}}_{i}^{L}, {\bar{w}}_{i}^{L} - w_{i}^{*}〉 + 〈{\bar{γ}}_{i}^{L}, {\bar{w}}^{L} - w^{*}〉 + 〈{\bar{γ}}_{i}^{L} - γ_{i}, {\bar{w}}_{i}^{L} - {\bar{w}}^{L}〉) \\ = & max_{\{γ_{i} : ∥γ_{i}∥ \leq g\}} \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}) - γ_{i} ({\bar{w}}_{i}^{L} - {\bar{w}}^{L})) \\ = & \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}) - max_{\{γ_{i} : ∥γ_{i}∥ \leq g\}} γ_{i} ({\bar{w}}_{i}^{L} - {\bar{w}}^{L})) \\ = & \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}) + g (∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥)); \end{matrix}

In addition, we also obtain the maximum of the right side as:

\begin{matrix} \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} \frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2} - \sum_{i = 1}^{l} \frac{1}{L} \sum_{l = 1}^{L} (ρ + 1 / η_{i}^{l}) 〈ξ_{i}^{l}, w_{i}^{*} - {\tilde{w}}_{i}^{l - 1}〉 \\ + \frac{1}{L} \sum_{i = 1}^{M} \frac{c_{w}^{2}}{2 η_{i}^{L}} + \frac{ρ M}{2 L} c_{w}^{2} + max_{\{γ_{i} : ∥γ_{i}∥ \leq g\}} \frac{1}{L} \sum_{i = 1}^{M} \frac{1}{2 ρ} {∥γ_{i} - γ_{i}^{0}∥}^{2} \\ = & \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} \frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2} - \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} (ρ + 1 / η_{i}^{l}) 〈ξ_{i}^{l}, w_{i}^{*} - {\tilde{w}}_{i}^{l - 1}〉 \\ + \frac{1}{L} \sum_{i = 1}^{M} \frac{c_{w}^{2}}{2 η_{i}^{L}} + \frac{ρ M}{2 L} c_{w}^{2} + \frac{M}{L} \frac{g^{2}}{2 ρ} . \end{matrix}

Thus, we obtain:

\begin{matrix} \sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}) + g ∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥) \\ \leq & \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} \frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2} - \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} (ρ + 1 / η_{i}^{l}) 〈ξ_{i}^{l}, w_{i}^{*} - {\tilde{w}}_{i}^{l - 1}〉 \\ + \frac{1}{L} \sum_{i = 1}^{M} \frac{c_{w}^{2}}{2 η_{i}^{L}} + \frac{ρ M}{2 L} c_{w}^{2} + \frac{M}{L} \frac{g^{2}}{2 ρ} . \end{matrix}

(A1)

Since

∥ℓ^{'} (\cdot)∥ \leq c_{1}

and

∥R^{'} (\cdot)∥ \leq c_{2}

, we have:

\begin{matrix} E [{∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2}] = {(c_{1} + λ c_{2} / M)}^{2} + 8 p c_{1}^{2} ln (1.25 / δ) / (m_{i}^{2} ϵ^{2}) . \end{matrix}

With

E [〈ξ_{i}^{l}, w_{i}^{*} - {\tilde{w}}_{i}^{l - 1}〉] = 0

and

η_{i}^{l} = c_{w} {(2 l {(c_{1} + λ c_{2} / M)}^{2} + 16 l p c_{1}^{2} ln (1.25 / δ) / (m_{i}^{2} ϵ^{2}))}^{- \frac{1}{2}}

, by taking expectation of the previous inequality (A1), we obtain:

\begin{matrix} E [\sum_{i = 1}^{M} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}^{*}) + g ∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥)] \\ \leq & \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} E [\frac{η_{i}^{l}}{2} {∥f_{i}^{'} ({\tilde{w}}_{i}^{l - 1}) - (ρ + 1 / η_{i}^{l}) ξ_{i}^{l}∥}^{2}] - \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} (ρ + 1 / η_{i}^{l}) E [〈ξ_{i}^{l}, w_{i}^{*} - {\tilde{w}}_{i}^{l - 1}〉] \\ + \frac{1}{L} \sum_{i = 1}^{M} \frac{c_{w}^{2}}{2 η_{i}^{L}} + \frac{ρ M}{2 L} c_{w}^{2} + \frac{M}{L} \frac{g^{2}}{2 ρ}, \end{matrix}

which leads to

\begin{matrix} E [\sum_{i = 1}^{n} (f_{i} ({\bar{w}}_{i}^{L}) - f_{i} (w_{i}^{*}) + g ∥{\bar{w}}_{i}^{L} - {\bar{w}}^{L}∥)] \\ = & \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} \frac{c_{w}}{2 \sqrt{2 l}} \sqrt{{(c_{1} + λ c_{2} / M)}^{2} + \frac{8 p c_{1}^{2} ln (1.25 / δ)}{m_{i}^{2} ϵ^{2}}} \\ + \sum_{i = 1}^{M} \frac{1}{L} \sum_{l = 1}^{L} \frac{c_{w} \sqrt{2 L}}{2} \sqrt{{(c_{1} + λ c_{2} / M)}^{2} + \frac{8 p c_{1}^{2} ln (1.25 / δ)}{m_{i}^{2} ϵ^{2}}} + \frac{M ρ}{2 t} c_{w}^{2} + \frac{M g^{2}}{2 ρ T} \\ = & \sum_{i = 1}^{M} \frac{c_{w}}{2 \sqrt{2} L} \sqrt{{(c_{1} + λ c_{2} / M)}^{2} + \frac{8 p c_{1}^{2} ln (1.25 / δ)}{m_{i}^{2} ϵ^{2}}} (\sum_{l = 1}^{L} \frac{1}{\sqrt{l}} + 2 \sqrt{L}) + \frac{M ρ}{2 L} c_{w}^{2} + \frac{M g^{2}}{2 ρ L} \\ \leq & \sum_{i = 1}^{M} \frac{\sqrt{2} c_{w}}{\sqrt{L}} \sqrt{{(c_{1} + λ c_{2} / M)}^{2} + \frac{8 p c_{1}^{2} ln (1.25 / δ)}{m_{i}^{2} ϵ^{2}}} + \frac{M (ρ c_{w}^{2} + g^{2} / ρ)}{2 L} . \end{matrix}

Thus, we complete the proof of Theorem 2. ☐

References

Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; Naor, M. Our Data, Ourselves: Privacy Via Distributed Noise Generation. In Proceedings of the Advances in Cryptology—EUROCRYPT 2006, St. Petersburg, Russia, 28 May–1 June 2006; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4004, pp. 486–503. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3876, pp. 265–284. [Google Scholar]
Dwork, C.; Rothblum, G.N.; Vadhan, S. Boosting and Differential Privacy; IEEE Computer Society: Los Alamitos, CA, USA, 2010. [Google Scholar]
Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends® Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Dwork, C.; Smith, A.; Steinke, T.; Ullman, J.; Vadhan, S. Robust Traceability from Trace Amounts. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, Berkeley, CA, USA, 17–20 October 2015; pp. 650–669. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 308–318. [Google Scholar]
Differential Privacy Team. Learning with Privacy at Scale. 2017. Available online: https://machinelearning.apple.com/research/learning-with-privacy-at-scale (accessed on 1 February 2022).
Ding, B.; Kulkarni, J.; Yekhanin, S. Collecting Telemetry Data Privately. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 3571–3580. [Google Scholar]
Erlingsson, U.; Korolova, A.; Pihur, V. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 1054–1067. [Google Scholar]
Wang, J.; Kolar, M.; Srebro, N.; Zhang, T. Efficient Distributed Learning with Sparsity. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3636–3645. [Google Scholar]
He, Y.; Zhou, Y.; Feng, Y. Distributed Feature Selection for High-dimensional Additive Models. arXiv 2022, arXiv:2205.07932. [Google Scholar]
Jordan, M.I.; Lee, J.D.; Yang, Y. Communication-Efficient Distributed Statistical Inference. J. Am. Stat. Assoc. 2019, 114, 668–681. [Google Scholar] [CrossRef]
Hu, A.; Jiao, Y.; Liu, Y.; Shi, Y.; Wu, Y. Distributed quantile regression for massive heterogeneous data. Neurocomputing 2021, 448, 249–262. [Google Scholar] [CrossRef]
Nedic, A.; Olshevsky, A.O.A.; Tsitsiklis, J.N. Distributed subgradient methods and quantization effects. In Proceedings of the 47th IEEE Conference on Decision and Control, Cancun, Mexico, 9–11 December 2008; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2008; pp. 4177–4184. [Google Scholar]
Nedic, A.; Ozdaglar, A. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans. Autom. Control. 2009, 54, 48–61. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends® Mach. Learn. 2011, 3, 1–122. [Google Scholar]
Ling, Q.; Ribeiro, A. Decentralized linearized alternating direction method of multipliers. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 5447–5451. [Google Scholar]
Shi, W.; Ling, Q.; Yuan, K.; Wu, G.; Yin, W. On the Linear Convergence of the ADMM in Decentralized Consensus Optimization. IEEE Trans. Signal Process. 2014, 62, 1750–1761. [Google Scholar] [CrossRef]
Zhang, R.; Kwok, J. Asynchronous distributed ADMM for consensus optimization. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 5, pp. 3689–3697. [Google Scholar]
Bianchi, P.; Hachem, W.; Iutzeler, F. A stochastic primal-dual algorithm for distributed asynchronous composite optimization. In Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA, 3–5 December 2014; pp. 732–736. [Google Scholar]
Wei, E.; Ozdaglar, A. Distributed Alternating Direction Method of Multipliers. In Proceedings of the 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), Maui, HI, USA, 10–13 December 2012; pp. 5445–5450. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks Against Machine Learning Models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–24 May 2017; pp. 3–18. [Google Scholar]
Fredrikson, M.; Jha, S.; Ristenpart, T. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 22nd ACM Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 1322–1333. [Google Scholar]
Zhang, T.; Zhu, Q. Dynamic Differential Privacy for ADMM-Based Distributed Classification Learning. IEEE Trans. Inf. Forensics Secur. 2017, 12, 172–187. [Google Scholar] [CrossRef]
Guo, Y.; Gong, Y. Practical Collaborative Learning for Crowdsensing in the Internet of Things with Differential Privacy. In Proceedings of the 2018 IEEE Conference on Communications and Network Security (CNS), Beijing, China, 30 May–1 June 2018; pp. 1–9. [Google Scholar]
Ding, J.; Errapotu, S.M.; Zhang, H.; Gong, Y.; Pan, M.; Han, Z. Stochastic ADMM Based Distributed Machine Learning with Differential Privacy. Secur. Priv. Commun. Netw. 2019, 304, 257–277. [Google Scholar]
Ferraty, F.; Vieu, P. Nonparametric Functional Data Analysis; Springer: New York, NY, USA, 2006. [Google Scholar]
Ramsay, J.O.; Silverman, B.W. Functional Data Analysis, 2nd ed.; Springer: New York, NY, USA, 2005. [Google Scholar]
Zhang, J.T. Analysis of Variance for Functional Data; Chapman and Hall/CRC: New York, NY, USA, 2013. [Google Scholar]
Ramsay, J.O.; Silverman, B.W. Applied Functional Data Analysis: Methods and Case Studies; Springer: New York, NY, USA, 2002. [Google Scholar]
Ramsay, J.O.; Dalzell, C.J. Some Tools for Functional Data Analysis. J. R. Stat. Soc. Ser. Methodol. 1991, 53, 539–561. [Google Scholar] [CrossRef]
Tang, Q.; Kong, L. Quantile regression in functional linear semiparametric model. Statistics 2017, 51, 1342–1358. [Google Scholar]
Lu, Y.; Du, J.; Sun, Z. Functional partially linear quantile regression model. Metrika 2014, 77, 317–332. [Google Scholar] [CrossRef]
Auton, T. Applied Functional Data Analysis: Methods and Case Studies. J. R. Stat. Soc. Ser. Stat. Soc. 2004, 167, 378–379. [Google Scholar] [CrossRef]
Hall, P.; Horowitz, J.L. Methodology and convergence rates for functional linear regression. Ann. Stat. 2007, 35, 70–91. [Google Scholar] [CrossRef]
Karhunen, K. Zur Spektraltheorie stochastischer Prozesse. Ann. Acad. Sci. Fenn. 1946, 1, 7. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. Stat. Methodol. 2011, 73, 273–282. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the Elastic Net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
Zhang, C. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Glowinski, R.; Marroco, A. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. ESAIM Math. Model. Numer. Anal. 1975, 9, 41–76. [Google Scholar] [CrossRef]
Gabay, D.; Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 1976, 2, 17–40. [Google Scholar] [CrossRef]
Huang, Z.; Gong, Y. Differentially Private ADMM for Convex Distributed Learning: Improved Accuracy via Multi-Step Approximation. arXiv 2020, arXiv:2005.07890. [Google Scholar]
Deng, W.; Yin, W. On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers. J. Sci. Comput. 2016, 66, 889–916. [Google Scholar] [CrossRef]
Wang, H.; Banerjee, A. Bregman Alternating Direction Method of Multipliers. Adv. Neural Inf. Process. Syst. 2014, 4, 2816–2824. [Google Scholar]
Hong, M.; Luo, Z. On the Linear Convergence of the Alternating Direction Method of Multipliers. Math. Program. 2017, 162, 165–199. [Google Scholar] [CrossRef]
Eckstein, J.; Bertsekas, D.P. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 1992, 55, 293–318. [Google Scholar] [CrossRef]
He, B.; Yuan, X. On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method. SIAM J. Numer. Anal. 2012, 50, 700–709. [Google Scholar] [CrossRef]
Sun, J.; Zhang, S. A modified alternating direction method for convex quadratically constrained quadratic semidefinite programs. Eur. J. Oper. Res. 2010, 207, 1210–1220. [Google Scholar] [CrossRef]
Zhang, S.; Ang, J.; Sun, J. An alternating direction method for solving convex nonlinear semidefinite programming problems. Optimization 2013, 62, 527–543. [Google Scholar] [CrossRef]
Goldstein, T.; O’Donoghue, B.; Setzer, S.; Baraniuk, R. Fast Alternating Direction Optimization Methods. SIAM J. Imaging Sci. 2014, 7, 1588–1623. [Google Scholar] [CrossRef]
Gabay, D. Chapter IX Applications of the Method of Multipliers to Variational Inequalities. Augment. Lagrangian Methods Appl. Solut. -Bound.-Value Probl. 1983, 15, 299–331. [Google Scholar]
Li, G.; Pong, T. Global Convergence of Splitting Methods for Nonconvex Composite Optimization. SIAM J. Optim. 2015, 25, 2434–2460. [Google Scholar] [CrossRef]
Yu, L.; Lin, N. ADMM for Penalized Quantile Regression in Big Data. Int. Stat. Rev. 2017, 85, 494–518. [Google Scholar] [CrossRef]
Huang, Z.; Hu, R.; Guo, Y.; Chan-Tin, E.; Gong, Y. DP-ADMM: ADMM-Based Distributed Learning With Differential Privacy. IEEE Trans. Inf. Forensics Secur. 2020, 15, 1002–1012. [Google Scholar] [CrossRef]

Figure 1. Impact of the number of distributed data sources on FDP-ADMM for

l_{1}

-regularized quantile regression.

Figure 1. Impact of the number of distributed data sources on FDP-ADMM for

l_{1}

-regularized quantile regression.

Figure 2. Convergence properties of FDP-ADMM via augmented objective value for

l_{1}

-regularized quantile regression with

τ = 0.5

.

Figure 2. Convergence properties of FDP-ADMM via augmented objective value for

l_{1}

-regularized quantile regression with

τ = 0.5

.

Figure 3. Convergence properties of FDP-ADMM via empirical loss for

l_{1}

-regularized quantile regression

τ = 0.5

.

Figure 3. Convergence properties of FDP-ADMM via empirical loss for

l_{1}

-regularized quantile regression

τ = 0.5

.

Figure 4. Impact of the number of distributed data sources on FDP-ADMM for

l_{2}

-regularized quantile regression.

Figure 4. Impact of the number of distributed data sources on FDP-ADMM for

l_{2}

-regularized quantile regression.

Figure 5. Convergence properties of FDP-ADMM via augmented objective value for

l_{2}

-regularized quantile regression with

τ = 0.5

.

Figure 5. Convergence properties of FDP-ADMM via augmented objective value for

l_{2}

-regularized quantile regression with

τ = 0.5

.

Figure 6. Convergence properties of FDP-ADMM via empirical loss for

l_{2}

-regularized quantile regression

τ = 0.5

.

Figure 6. Convergence properties of FDP-ADMM via empirical loss for

l_{2}

-regularized quantile regression

τ = 0.5

.