1. Introduction
Consider a digital bank interested in building a prediction model for credit scoring based on data features of given individuals, such as saving information and spending habits, that are distributed across other banks, fintech companies, and online retail shops (see
Figure 1). Data labels indicating loan approval or rejection reside at a trusted third-party credit bureau, which keeps track of the approved loans [
1]. This setting exemplifies vertical federated learning (FL), in which data features are scattered across different participating agents, with data barriers between them preventing a direct exchange of information.
Unlike conventional horizontal FL, in which agents have independent data points, in vertical FL settings, inter-agent collaboration can be beneficial not only during the learning phase but also during the inference phase [
2,
3]. It is therefore important to understand at a fundamental theoretical level whether decentralization, wherein agents use only local data for learning and/or inference, entails a significant performance loss as compared to collaborative learning and/or inference. This is the subject of this paper.
As a first attempt in this direction, Chen et al. [
3] address this problem by studying a binary classification problem in which each class corresponds to a bivariate Gaussian distribution over two input features, which are vertically distributed between two agents. The authors identify four collaboration settings depending on whether collaboration is done during learning and/or inference phases as collaborative learning–collaborative inference (CL/CI), collaborative learning–decentralized inference (CL/DI), decentralized learning–collaborative inference (DL/CI), and decentralized learning–decentralized inference (DL/DI). By taking a frequentist approach, the authors compare the classification error rates achieved under these four settings.
In this work, inspired by [
3], we develop a novel
information-theoretic approach to quantify the cost of decentralization for
general supervised learning problems with
any number of agents and under
privacy constraints. Specifically, we consider a supervised learning problem defined by an arbitrary joint distribution
involving the feature vector
and label
Y, with the feature vector vertically partitioned between any number of local agents. A trusted central server, also called a data scientist or aggregator [
4], holds the labels, which it shares with the agents upon request (see
Figure 1). The agents collaborate through the aggregator during learning and/or inference. To limit the information leakage from the shared feature to an adversarial eavesdropper, unlike [
3], privacy constraints are imposed on the aggregation mapping. By adopting a Bayesian framework, we characterize the average predictive performance of the four settings—CL/CI, CL/DI, DL/CI, and DL/DI—under privacy constraints via information-theoretic metrics. Finally, we illustrate the relation between the four collaboration settings with/without privacy constraints on two numerical examples.
In line with the recent works of [
5,
6], this work relates information-theoretic measures to learning centric performance metrics with the goal of providing theoretical insights. Specifically, we leverage information-theoretic tools to gain insights into the performance degradation resulting from decentralized learning and/or inference for general supervised learning problems. The main contribution is hence of theoretical nature, as it provides a connection between information-theoretic metrics and practically relevant measures of generalization in decentralized Bayesian learning and inference.
2. Problem Formulation
Setting: We study a vertical federated learning (FL) setting with
K agents that can cooperate during the learning and/or inference phases of operation of the system. Our main goal is to quantify, using information-theoretic metrics, the benefits of cooperation for learning and/or inference. We focus on a supervised learning problem, in which each data point corresponds to a tuple
encompassing the
K-dimensional feature vector
and the scalar output label
Y. As illustrated in
Figure 1, each
kth feature
in vector
is observed only by the
kth agent. A trusted central server, referred to as the aggregator, holds the output label
Y, which it shares with the agents on request [
4,
7]. Features and labels can take values in arbitrary alphabets. The unknown data distribution is assumed to belong to a model class
of joint distributions that are identified by a model parameter vector
W taking values in some space
. Adopting a Bayesian approach, we endow the model parameter vector with a prior distribution
.
As illustrated in
Figure 1, let
denote a training data set of
N labelled samples, which, when conditioned on model parameter
W, are assumed to be generated i.i.d. according to distribution
. The
matrix
collects the
K-dimensional feature vectors
by rows. We denote as
, the
th element of matrix
, for
and
; and as
(
is the transpose operation), the
kth column of the data matrix, which corresponds to the observations of agent
k. The goal of the system is to use the training data set
to infer the model parameter
W, which enable the agents to predict the label of a new, previously unseen, test feature input
. The joint distribution of model parameter
W, training data
, and test data
can be written as follows ([
8], Chapter 3.3):
with ⊗ representing the product of distributions, and conditional distribution
being equal to
for
.
Collaborative/decentralized learning/inference: In the learning phase, training data is used to infer the model parameter W, enabling the agents in the inference phase to make predictions about test label Y given the test feature vector based on the model . Either or both learning and inference phases can be carried out collaboratively by the agents or in a decentralized fashion (i.e., separately by each agent). When collaborating for learning or inference, the K agents share their locally observed feature data via the aggregator. The operation of the aggregator is modelled as a stochastic aggregation mapping from the input K local features to an output shared feature , to be used by each of the K local agents. As detailed next, for learning, the mapping is applied independently to each data point. Furthermore, as we also detail later in this section, we impose privacy constraints on the aggregation mapping so that the shared feature does not reveal too much information about the local agents’ features.
We specifically distinguish the following four settings:
Collaborative learning–collaborative inference (CL/CI): Agents collaborate during both learning and inference phases by sharing information about their respective features. Accordingly, during learning, each agent has access to the shared training data features , where each nth component is generated independently by the aggregator in response to the observed feature vector , in addition to its own observed local feature data . Furthermore, during inference, agent k can use the shared test feature , obtained by aggregating the test feature vector , in addition to its own observation , in order to predict the test label Y.
Collaborative learning–decentralized inference (CL/DI): Agents collaborate only during learning by sharing information about their respective features as explained above, while inference is decentralized. Accordingly, during inference, each kth agent uses the kth feature of test feature vector in order to predict the test label Y.
Decentralized learning–collaborative inference (DL/CI): Agents collaborate for inference, while each kth agent is allowed to use only its observed training data , along with the labels shared by the aggregator, during learning.
Decentralized learning–decentralized inference (DL/DI): Agents operate independently, with no cooperation in either learning or inference phases.
Privacy constraints: The aggregation mapping shares the output feature with each of the K local agents during collaborative learning and/or inference. To account for privacy constraints concerning agents’ data, we limit the amount of information that a “curious” eavesdropper may be able to obtain about the local features’ data from observing . To this end, we impose the following privacy constraint on the aggregation mapping so that the shared feature does not leak too much information about the local features of all agents .
The aggregation mapping
is said to be
- individually private if
where
and
is the conditional mutual information under the joint distribution
, with
being the marginal of
. The constraint (
2) measures privacy against a strong eavesdropper that knows all features except the
kth feature
. Specifically, the conditional mutual information
quantifies the additional information about
gained by the eavesdropper upon observing the shared feature
. As such, the metric is also relevant as a privacy measure against “curious” agents.
We note that although the privacy constraint in (
2) bears a resemblance to the MI-differential privacy (MI-DP) constraint introduced in [
9], the condition (
2) does not have the same operational meaning. In fact, the MI-DP constraint in [
9,
10] or the
f-divergence-based DP constraint in [
11] ensure differential privacy for individual i.i.d. data samples of a training data set, and they rely on a mechanism that applies to the entire data set during learning. In contrast, the constraint (
2) accounts for the privacy of correlated local features via a per-sample masking mechanism, and it applies to both learning and inference phases.
Predictive loss under privacy constraints: In all the four settings described above, any agent
k uses the available training data
, with
being equal to
for decentralized learning and to
for collaborative learning, in order to infer the model parameter
W. The inferred model is then used to predict the label
Y given the test feature input
, with
being equal to
for decentralized inference and to
for collaborative learning. We impose that the aggregation mapping
must satisfy the privacy constraint in (
2).
The joint operation of learning and inference at agent
k can be accordingly described via a stochastic predictive distribution
on the test label
Y given the training data
and test feature input
. The predictive distribution can be thought of as the result of a two-step application of learning and inference, where a model parameter is first learned using the input training data
and is subsequently used to infer the label corresponding to the test feature input
. Note that this stochastic mapping can account for arbitrary choices of learning and inference algorithms. By optimizing over aggregation mapping as well as over learning and inference algorithms, we define the
-private predictive loss as
In (
3), the aggregation mapping
is optimized over some specified family
of conditional distributions
in order to minimize the worst-case predictive loss across the agents under constraint (
2). Furthermore, the inner optimization is over a class of predictive distributions
.
In the absence of privacy constraints (i.e., when
), assuming that the distribution family
is sufficiently large, the optimal aggregation mapping
puts its entire mass on the output shared feature
. As such, under collaborative learning, each agent
k uses the entire feature data (i.e.,
), and under collaborative inference, it uses the entire test feature vector
. The predictive loss (
3) in the absence of privacy constraints is evaluated as
The predictive loss (
4) represents the worst-case minimum average cross-entropy loss across all agents, which can be obtained given the information about the training data set and the test input feature [
5].
3. Preliminaries and Fully Collaborative Benchmark
In this section, we first provide a brief explanation of the main information-theoretic metrics used in this work. Then, we define and derive the average predictive loss for the benchmark case in which both learning and inference are collaborative.
Information-theoretic metrics: Let
A and
B denote two (discrete or continuous) random variables with joint distribution
, and with corresponding marginals
and
. The joint entropy of
A and
B, denoted
, is defined as
, with
denoting the expectation with respect to distribution
P. More generally, the conditional entropy of
A given
B is defined as
, where
is the conditional distribution of
A given
B. By the chain rule, we have the relationship
; we also have the property that conditioning does not increase entropy [
12] (i.e.,
). The mutual information
between the random variables is defined as
Finally, for random variables
, and
C with joint distribution
, the conditional mutual information
between
A and
B given
C is defined as
.
Private collaborative learning–collaborative inference (CL/CI): As a benchmark, we now study the predictive loss (
3) for the CL/CI setting. The
-private predictive loss (
3) of CL/CI is given as
where
is the feasible space of conditional distributions satisfying the privacy constraint (
2). The following lemma presents an information-theoretic characterization of the loss
.
Lemma 1. Assume that the family comprises the set of all predictive distributions . Then, the ϵ-private predictive loss (5) for the CL/CI setting evaluates as In addition, if , and includes the space of all conditional distributions , then the predictive loss (4) in the absence of privacy constraints for CL/CI is evaluated as Proof. For a fixed aggregation mapping
, and an agent
k, the predictive distribution that minimizes the inner cross entropy term in (
5),
, is the posterior distribution,
[
12], resulting in the conditional entropy term in (
7). When
and
includes the space of all conditional distributions, we have
and
, yielding (
8). □
4. Cost of Decentralization under Privacy Constraints
In this section, we use the benchmark predictive loss (
7) observed under the ideal CL/CI setting to evaluate the cost of decentralization in the learning and/or inference phases under privacy constraints.
Lemma 2. The ϵ-private predictive losses of decentralized learning and/or inference are given aswhere set is as defined in (6). Proof. The result is a direct extension of Lemma 1 to CL/DI, DL/CI, and DL/DI. □
Note that the predictive loss (11) of the fully decentralized DL/DI setting does not depend on the privacy parameter
, since decentralization does not entail any privacy loss. Therefore, in the absence of privacy constraints, we have
, while the predictive losses in (
9) and (10) evaluate as
under the assumption of sufficiently large
. Furthermore, using the property that conditioning does not increase entropy [
8] results in the following relation between the predictive losses of the four schemes—CL/CI, CL/DI, DL/CI and DL/DI—in the absence of privacy constraints:
The difference between the
-private predictive risks of the decentralized and collaborative schemes captures the
cost of decentralization. Specifically, given two schemes
CL/CI, CL/DI, DL/CI, DL/DI} such that
, we define the cost of
a with respect to
b as
In the absence of privacy constraints
and assuming symmetric agents so that the maximum in (
4) is attained for any
, the cost of decentralization can be exactly characterized as in the following result.
Proposition 1. The cost of decentralization (15) for and symmetric agents can be characterized for the kth learning agent as detailed in Table 1, where and . Proof. We illustrate the derivation of the cost of decentralization between CL/DI and CL/CI, as the proof can be similarly completed. In the absence of privacy constraints and assuming symmetric agents, we have from (
8) and (
12),
□
The results in
Table 1 have intuitive interpretations. For instance, the cost
corresponds to the additional information about label
Y that can be obtained from observing the features
of other agents, given
, and
. Examples will be provided in the next section in which the cost of decentralization is evaluated also in the presence of privacy constraints based on (
7), (
9)–(11).