Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations

Edemacu, Kennedy; Kim, Jong Wook

doi:10.3390/math10132185

Open AccessArticle

Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations

by

Kennedy Edemacu

¹

and

Jong Wook Kim

^2,*

¹

Department of Computer Science and Electrical Engineering, Muni University, Arua P.O. Box 725, Uganda

²

Department of Computer Science, Sangmyung University, Seoul 03016, Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(13), 2185; https://doi.org/10.3390/math10132185

Submission received: 1 June 2022 / Revised: 19 June 2022 / Accepted: 20 June 2022 / Published: 23 June 2022

(This article belongs to the Special Issue Advanced Deep Learning and Mathematical Modeling for Reliability, Security and Privacy Problems in Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Due to privacy concerns, multi-party gradient tree boosting algorithms have become widely popular amongst machine learning researchers and practitioners. However, limited existing works have focused on vertically partitioned datasets, and the few existing works are either not scalable or tend to leak information. Thus, in this work, we propose SSXGB, which is a scalable and acceptably secure multi-party gradient tree boosting framework for vertically partitioned datasets with partially outsourced computations. Specifically, we employ an additive homomorphic encryption (HE) scheme for security. We design two sub-protocols based on the HE scheme to perform non-linear operations associated with gradient tree boosting algorithms. Next, we propose secure training and prediction algorithms under the SSXGB framework. Then, we provide theoretical security and communication analysis for the proposed framework. Finally, we evaluate the performance of the framework with experiments using two real-world datasets.

Keywords:

gradient tree boosting; multi-party machine learning; privacy-preservation; homomorphic encryption; vertically partitioned dataset

MSC:

68T07

1. Introduction

The privacy-preserving multi-party machine learning paradigm has shown promising potential in encouraging collaboration between organizations while preserving the privacy of their data [1]. The basic idea of the privacy-preserving multi-party machine learning is that each collaborating party holds a private dataset and trains a local model using the dataset. The local models from the participating parties are then aggregated to create a single and more powerful model. Hence, different organizations can jointly train a machine learning model without sharing their private datasets. Usually, privacy-preserving mechanisms such as multi-party computation, homomorphic encryption, and differential privacy are employed to improve the security of privacy-preserving multi-party machine learning frameworks.

Although the privacy-preserving multi-party machine learning has attracted a lot of attention recently, the majority of the existing works focus on linear regression [2,3], logistic regression [4,5] and neural networks [6,7] over vertically and horizontally partitioned datasets (For horizontally partitioned datasets, participants hold subsets of the samples with the same features. For example, two regional banks may have different user groups from their regions, with the intersection between the user groups being small (different samples). However, the two banks have similar businesses and hence, similar features. While for vertically partitioned datasets, participants hold the same samples with different features. For example, two companies, a bank and an e-commerce company located in the same city. The two companies are likely to have the same users, but since their businesses are different, they will have different features [8]).

Similar to the above methods, gradient tree boosting [9], which is one of the most popular machine learning methods, has also received considerable attention due to its effectiveness in a wide range of application areas such as fraud detection [10], feature selection [11] and product recommendation [12]. Efforts to address the privacy concerns for gradient tree boosting in multi-party settings are presented in [13,14,15,16,17,18]. The datasets in [13,14,15] are horizontally partitioned while the datasets are vertically partitioned in [16,17,18].

In this work, we focus on the latter dataset partitioning. The current privacy preservation efforts proposed for the multi-party gradient tree boosting method with vertically partitioned datasets have a number of limitations. In [18], the proposed scheme is not scalable; it is limited to only two collaborating parties. In [16,17], intermediate information is revealed during the model training. Thus, designing a scalable and yet secure gradient tree boosting scheme has remained open for investigation, and hence, we intend to answer the question of how to construct a scalable (in terms of collaborating parties) and secure XGBoost [19] over vertically partitioned datasets in this work.

Apart from the high memory usage challenge, the secure XGBoost model training requires complicated computation primitives such as division and argmax [18]. To address these challenges and build a scalable but secure XGBoost over vertically partitioned datasets, we propose the SSXGB framework that securely outsources and performs the complicated computations in encrypted form. The key idea is to allow the participants to jointly train a model by sharing their encrypted information with a server that, in turn, collaborates with a second server to securely perform further computations to complete the generation of the model. Specifically, we present an additive homomorphic encryption (HE) scheme that provides the addition (Add) and subtraction (Sub) primitives. We also present sub-protocols designed to provide additional primitives such as the multiplication (Mult) and comparisons. Next, we propose new sub-protocols based on the HE scheme for the division (Div) and argmax primitives. We employ the secure computation primitives to build the scalable and secure XGBoost model. Then, we present a secure prediction algorithm for predictions based on the trained model. We present the analysis of our framework and its implementation using real-world datasets. A summary of our contributions is presented as follows:

We propose sub-protocols based on an additive HE scheme used to perform primitive secure operations during a machine learning task. The sub-protocols are collaboratively executed by two non-colluding servers.
We design a novel scalable and privacy-preserving multi-party XGBoost training algorithm and a corresponding prediction algorithm. The algorithms are constructed under the semi-honest security assumption and there is no limit on the number of participants involved.
We conduct experiments using real-world datasets to demonstrate the effectiveness and efficiency of our proposed framework.

The rest of the paper is organized as follows. In Section 2, we present the related works. Section 3 contains the preliminary concepts. In Section 4, we present our proposed HE sub-protocols for non-linear operations. We present the overview of the proposed SSXGB framework in Section 5. Section 6 and Section 7 present the secure training and prediction algorithms of the SSXGB framework. In Section 8, we present theoretical security and communication analysis. Performance evaluation is presented in Section 9, and Section 10 concludes the paper.

2. Related Work

Recently, efforts devoted to multi-party machine learning research have shown a huge potential in addressing the training data scarcity problem while preserving data privacy [1,5,20,21]. However, the majority of the works focus on linear machine learning models. Little effort has been invested in researching multi-party gradient tree boosting models. Currently, multi-party gradient tree boosting frameworks can be categorized as horizontal, vertical and generic frameworks, depending on how the datasets are partitioned amongst the collaborating participants.

2.1. Horizontal Multi-Party Gradient Tree Boosting Frameworks

In horizontal multi-party gradient tree boosting frameworks, sets of samples are shared amongst the collaborating participants. Several works have adopted this approach. In [14], Ong et al. designed a multi-party gradient tree boosting framework in which the participants exchange adaptive histogram representations of their data during model learning. Liu et al. combined secret sharing with homomorphic encryption to prevent participants from dropping out and securely aggregate their gradients during XGBoost training [15]. Reference [22] employed an oblivious algorithm to prevent privacy violations at hardware enclaves during learning of a multi-party gradient tree boosting model. In [23], Yang et al. designed a multi-party tree boosting framework with anomaly detection from extremely unbalanced datasets. Reference [24] designed secure training and prediction frameworks for multi-party gradient tree boosting. A secret sharing scheme is employed for the secure training, while a key agreement scheme and an identity-based encryption and signature scheme are employed for the secure prediction framework. Unlike the above frameworks, our work focuses on vertically partitioned datasets.

2.2. Vertical Multi-Party Gradient Tree Boosting Frameworks

In vertical multi-party gradient tree boosting frameworks, dataset features are shared amongst the collaborating participants. Several existing efforts have focused on addressing concerns in this setting. In [16], Cheng et al. designed a lossless privacy-preserving multi-party gradient tree boosting framework using a homomorphic encryption scheme. The framework achieves the same accuracy as the non-federated gradient tree boosting frameworks. However, it reveals the intermediate parameters during the training process, which can lead to privacy violations. In [17,18], the authors propose secure training and prediction frameworks for privacy-preserving multi-party gradient tree boosting. However, their schemes have limited scalability, i.e., they are limited to two parties. In contrast, in our work, we propose a privacy-preserving multi-party gradient tree boosting framework that is scalable and does not expose the intermediate parameters.

2.3. Generic Multi-Party Gradient Tree Boosting Frameworks

Unlike the horizontal and vertical categories, the generic multi-party gradient tree boosting frameworks support training for generically split datasets (vertical and horizontal split datasets or their combinations). A recent work [25] proposed by Deforth et al. focuses on this category. By minimizing the number of oblivious permutation evaluations using optimization techniques and employing the Manticore multi-party computation framework [26], the authors are able to achieve a scalable and secure gradient tree boosting framework. Our work focuses on using HE instead of the multi-party computation techniques employed in this work.

3. Preliminaries

This section summarizes the gradient tree boosting framework, XGBoost, and the cryptographic foundations used to construct our proposed privacy-preserving multi-party gradient tree boosting framework.

3.1. XGBoost

XGBoost is an implementation of gradient tree boosting. It trains an additive ensemble model in a sequence of iterations [19]. At each iteration, a weaker model is learned and added to the ensemble model. Thus, the ensemble model is the sum of all the weaker models. To train a model with n samples, let the feature set and label for sample i be (

X_{i}, y_{i}

), and

\hat{y}

be the model prediction. The goal is to build a model that minimizes the following objective loss function at iteration t [16,18,19].

L = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (X_{i})) + \sum_{k = 1}^{t} Ω (f_{k})

(1)

where,

{\hat{y_{i}}}^{(t - 1)}

denotes the predicted value of the ensemble model at iteration

t - 1

,

f_{t} (X_{i})

denotes the predicted value of the weaker model built at the iteration t and

Ω

is the regularization term associated with the leaf nodes’ number and weight values.

In XGBoost, the objective loss function is approximated as follows:

L = \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (X_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (X_{i})] + \sum_{k = 1}^{t} Ω (f_{k})

(2)

where

g_{i}

and

h_{i}

are the first- and second-order gradients of the loss function for iteration

t - 1

, which are defined as follows:

g_{i} = \partial_{{\hat{y_{i}}}^{(t - 1)}} l (y_{i}, {\hat{y_{i}}}^{(t - 1)}) and h_{i} = \partial_{{\hat{y_{i}}}^{(t - 1)}}^{2} l (y_{i}, {\hat{y_{i}}}^{(t - 1)})

(3)

The sum of

g_{i}

and

h_{i}

for a node’s instance set

I_{j}

can be computed as:

G_{j} = \sum_{i \in I_{j}} g_{i} and H_{j} = \sum_{i \in I_{j}} h_{i} .

(4)

The optimal weight

w_{j}^{*}

for the leaf node is obtained as:

w_{j}^{*} = - \frac{\sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i} + λ}

(5)

where

λ

is the regularizer for the leaf weight.

At each iteration, i.e., during the construction of each tree, Equation (6) is used for split decisions at each intermediate node for every possible split. The split with the highest value of

L_{s p l i t}

is chosen.

L_{s p l i t} = \frac{1}{2} [\frac{{(\sum_{i \in I_{L}} g_{i})}^{2}}{\sum_{i \in I_{L}} h_{i} + λ} + \frac{{(\sum_{i \in I_{R}} g_{i})}^{2}}{\sum_{i \in I_{R}} h_{i} + λ} - \frac{{(\sum_{i \in I} g_{i})}^{2}}{\sum_{i \in I} h_{i} + λ}] - γ

(6)

where

γ

is the regularizer for the leaf number, and

I_{L} and I_{R}

make up the instance set for the left and right child nodes. Thus,

G_{L} = \sum_{i \in I_{L}} g_{i}

and

H_{L} = \sum_{i \in I_{L}} h_{i}

denote the sum of

g_{i}

and

h_{i}

for the left child node instance space, and

G_{R} = \sum_{i \in I_{R}} g_{i}

and

H_{R} = \sum_{i \in I_{R}} h_{i}

denote the sum of

g_{i}

and

h_{i}

for the right child node instance space.

3.2. Homomorphic Encryption (HE)

In this work, we adopt the BCP Scheme [27,28], which is an additive homomorphic encryption scheme. The scheme comprises the following algorithms:

(1) Setup ( $κ$ ): For a given security parameter

κ

and two large primes p and q of length

κ

bits, the algorithm generates the public parameters (

p p

) and the master key (

m k

) as follows. First, it computes

N = p q

. It then randomly chooses

g \in Z_{N^{2}}^{*}

of order

p p^{'} q q^{'}

(s.t.

g^{p^{'} q^{'}} = 1 + k N

for

k \in [1, N - 1]

), where

p^{'} = \frac{p - 1}{2}

and

q^{'} = \frac{q - 1}{2}

. The algorithm outputs

p p

as (

N, g, k

) and

m k

as (

p^{'}, q^{'}

).

(2) KeyGen ( $p p$ ): Generates the public-secret key pairs for users. To generate the key pair for a user i, the algorithm randomly picks

a_{i} \in Z_{N^{2}}

, and outputs the public key (

p k_{i}

) as

h_{i} = g^{a_{i}} \mod N^{2}

and the secret key (

s k_{i}

) as

a_{i}

.

(3) Enc ( $p p, p k_{i}, m$ ): Encrypts the message

m \in Z_{N}

under the public key

p k_{i}

. To encrypt m, the algorithm randomly chooses

r \in Z_{N^{2}}

and outputs the ciphertext (

C T

) as (

A, B

), where

A = g^{r} \mod N^{2} and B = h_{i}^{r} (1 + m N) \mod N^{2} .

(4) Dec ( $p p, s k_{i}, C T$ ): Recovers the message m from

C T = (A, B)

using the corresponding

s k_{i} = a_{i}

. The recovery is performed as follows:

m = \frac{B / (A^{a_{i}}) - 1 \mod N^{2}}{N} .

(7)

Note that the above recovery is successful only for the

p k_{i}

-

s k_{i}

pair.

(5) mDec ( $p p, p k_{i}, m k, C T$ ): Recovers any properly created ciphertext using the master key, i.e., the algorithm can decrypt

C T

encrypted under any users’ public key

p k_{i}

(so long as

p k_{i}

is legitimate). For the decryption to proceed, first

a \mod N

and

r \mod N

are computed as:

a \mod N = \frac{h^{p^{'} q^{'}} - 1 \mod N^{2}}{N} \cdot k^{- 1} \mod N

(8)

and

r \mod N = \frac{A^{p^{'} q^{'}} - 1 \mod N^{2}}{N} \cdot k^{- 1} \mod N,

(9)

where

k^{- 1}

is the inverse of

k \mod N

. m is recovered from

C T

as:

m = \frac{{(B / (g^{γ}))}^{p^{'} q^{'}} - 1 \mod N^{2}}{N} \cdot δ \mod N,

(10)

where

δ

is the inverse of

p^{'} q^{'} \mod N

and

γ : = a r \mod N

.

3.3. BCP Sub-Protocols

To perform arithmetic operations homomorphically, Refs. [27,28,29] proposed the following sub-protocols.

(a)KeyProd: The sub-protocol transforms the encryptions under the different user public keys

p k_{1}, \dots, p k_{n}

to encryptions under a joint public key

p k_{Σ}

:=

\prod_{i = 1}^{n} p k_{i}

. The transformation is an interactive process that involves two non-colluding servers (see [28] for the details). The encryption under the joint public key

p k_{Σ}

can only be decrypted using the sum of all the user secret keys

s k_{Σ} = \sum_{1}^{n} s k_{i}

or the master key.

(b)Add: Returns the encrypted sum of two encrypted messages. Suppose two messages

m_{1}

and

m_{2}

are encrypted as

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

, respectively.

⟦ . ⟧

denotes an encryption operation under the joint public key in our case. The Add sub-protocol sums the two ciphertexts as

⟦ m_{1} + m_{2} ⟧

. The fact that the BCP HE scheme achieves its additive nature under the same public key straightaway simplifies the Add sub-protocol. The encryptions under different user public keys can first be transformed to be under the same public key using the KeyProd sub-protocol, then followed by their addition. Thus, the sum of two encryptions (

A, B

) and (

A^{'}, B^{'}

) under the same public key can be computed as (see [28] for the details):

(\bar{A}, \bar{B}) \leftarrow (A \cdot A^{'} \mod N^{2}, B \cdot B^{'} \mod N^{2}) .

(c)Mult: The Mult sub-protocol returns the encrypted product of two ciphertexts. The process involves an interaction between two non-colluding servers (see [28] for the details). Thus, given two ciphertexts

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

, the Mult sub-protocol returns

⟦ m_{1} \times m_{2} ⟧

.

(d)TransDec: The TransDec sub-protocol does the opposite of the KeyProd sub-protocol. It transforms the encryptions under the joint public key

p k_{Σ}

to encryptions under the user public keys

p k_{1}, \dots, p k_{n}

. The transformation process involves interactions between two non-colluding servers (see [28] for the details).

(e)Neg: The Neg sub-protocol negates an encrypted message. For example, given an encryption of a message m as

⟦ m ⟧

, the Neg sub-protocol transforms it to

⟦ - m ⟧

as (see [27,29] for the details):

(g^{(N - 1) \cdot r} \mod N^{2}, h^{(N - 1) \cdot r} {(1 + m \cdot N)}^{N - 1} \mod N^{2}) = ⟦ - m ⟧ .

(f)Exp: Using the same principles as in (e) above, the Exp sub-protocol returns the product of an encrypted message

⟦ m ⟧

and a constant

κ

as

⟦ κ m ⟧

. It can be computed as below:

(g^{(N + κ) \cdot r} \mod N^{2}, h^{(N + κ) \cdot r} {(1 + m \cdot N)}^{N + κ} \mod N^{2}) = ⟦ κ m ⟧ .

The correctness is similar to the Neg sub-protocol.

(g)Sub: The Sub sub-protocol returns the difference between two ciphertexts. For example, given

m_{1}

and

m_{2}

encrypted as

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

, respectively, the Sub sub-protocol returns

⟦ m_{1} - m_{2} ⟧

. We describe the sub-protocol as follows: First,

⟦ m_{2} ⟧

is negated using the Neg sub-protocol. Then, the Add is used to complete the process. Thus,

⟦ m_{1} - m_{2} ⟧ = Add (⟦ m_{1} ⟧, Neg (⟦ m_{2} ⟧)) .

(h)LGT: The less than or greater than (LGT) sub-protocol shows the relationship between two ciphertexts, i.e.,

⟦ m_{1} ⟧ \geq ⟦ m_{2} ⟧

or

⟦ m_{1} ⟧ < ⟦ m_{2} ⟧

. The sub-protocol returns 1 if

⟦ m_{1} ⟧ < ⟦ m_{2} ⟧

, it returns 0 otherwise. It is an adaptation of the SLT protocol in [29]. A detailed description is presented in Appendix A.

4. Proposed Computation Primitives

Building gradient tree boosting algorithms requires more complicated computation primitives such the division and argmax. The division sub-protocol based on BCP scheme proposed in [29] is inefficient and unsuitable for our setting. Thus, we propose two sub-protocols Div and Sargmax based on the BCP HE scheme to perform the division and argmax operations, respectively.

4.1. `Div`

The Div sub-protocol outputs the encrypted division of two ciphertexts. Given two ciphertexts

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

, the Div sub-protocol returns

⟦ m_{1} / m_{2} ⟧

,

m_{1}

being the nominator and

m_{2}

the denominator. The protocol is run interactively between two non-colluding servers, say server C and server S, as illustrated in Figure 1.

Using the Exp sub-protocol, the server C first masks the ciphertexts

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

as

⟦ τ_{1} m_{1} ⟧

and

⟦ τ_{2} m_{2} ⟧

, respectively, where

τ_{1}, τ_{2} \in Z_{N}

. The server C then sends

⟦ τ_{1} m_{1} + τ_{2} m_{2} ⟧

and

⟦ τ_{1} m_{2} ⟧

to the server S. The two ciphertexts are decrypted by the server S. In plaintext, the server S performs

\frac{1}{τ_{1} m_{2}} \times (τ_{1} m_{1} + τ_{2} m_{2})

and encrypts the result and sends it back to the server C. The server C extracts

⟦ \frac{m_{1}}{m_{2}} ⟧

by subtracting

⟦ \frac{τ_{2}}{τ_{1}} ⟧

out of the result received from the server S. See Appendix B for the proof of correctness.

4.2. `Sargmax`

The Sargmax sub-protocol returns the arguments of the maximum value. In our case, the maximum value returned is in encrypted form. The maximum encrypted value is obtained using the LGT sub-protocol. Once the maximum value is obtained, its associated arguments are returned as the Sargmax sub-protocol’s result. The details are shown in Algorithm 1.

Algorithm 1: Secure argmax Computation Algorithm

1: function Sargmax(dict)
2: //Input: dict -is a dictionary of encrypted values
3: max=None
4: for key in dict.keys() do
5: if max==None then
6: max = dict[key]
7: else
8: //Securely compare the encrypted values
9: LGT(max, dict[key])
10: if max>dict[key] then
11: max=max
12: else
13: max=dict[key]
14: end
15: end
16: end
17: return key
18: end

5. Overview of Our Proposed Framework

In this section, we first describe the involved entities, followed by a security assumption and the workflow of our proposed framework.

5.1. Entities of the Framework

Our proposed framework comprises three types of entities: a set of participants, and two servers S and C, as illustrated in Figure 2.

Participants: Participants are volunteers willing to take part in multi-party gradient tree boosting model learning. In this work, each participant holds a portion of a vertically partitioned dataset. We refer to the participant holding the label feature as Label Bearing Participant (LBP). There is only one LBP. Each participant only interacts with Server C. Additionally, each participant holds a public-private key pair.

Server S: S holds

p k_{Σ}

,

m k

and

p k_{1}, \dots, p k_{n}

. Thus, S can decrypt any legitimately encrypted message. However, S only communicates directly with server C, i.e., it does not directly access the participants’ data. It mainly helps with decrypting masked data from server C. We assume that S is honest-but-curious [30,31,32].

Server C: C holds

p k_{Σ}

and

p k_{1}, \dots, p k_{n}

. C directly communicates with all the other entities. It has access to the instance space and encrypted intermediate parameters received from the participants. C and S then collaboratively perform computations on the received parameters to build the model. All the data from C to S are masked to prevent S from observing the actual contents of the data. We also assume that C is honest-but-curious. An example of C is a cloud server.

5.2. Security Assumptions

We assume a semi-honest and non-colluding security model. In other words, all the entities follow the protocols but are curious about the other entities’ inputs and parameters. Furthermore, the entities do not collude with each other. We also assume that the communication channels between the entities are secure, i.e., no information is revealed during its transmission.

5.3. Workflow of Our Proposed Scheme

The general workflow of our proposed SSXGB training is shown in Algorithm 2. As shown in Algorithm 2, three major protocols namely: LbpXgbTrain, SBuildTree and SPredTree are invoked during the model learning. To prevent inference attacks on label information, we adopt the second proposal of [16], where the first tree is built by the LBP. The LbpXgbTrain protocol is thus executed by the LBP for the above purpose.

Algorithm 2: Scalable and Secure XGBoost Training

1: function SxgbTrain(X, Y)
2: //Input: X:

{X^{i}}_{i = 1}^{n}

is an aggregation of X

^{i}

from n participants.
3: //Input: Y is the label borne by the LBP
4: for t =

1, \dots, T

do
5: if TreeList==Empty then
6: LbpXgbTrain(

X^{l b p}, Y

)
7: else
8: Compute: ⟦G

_{t - 1}

⟧:

\sum_{i} (⟦ g_{t - 1} ⟧)

and ⟦H

_{t - 1}

⟧:

\sum_{i} (⟦ h_{t - 1} ⟧)

9: //Construct a tree using ⟦G

_{t - 1}

⟧ and ⟦H

_{t - 1}

⟧
10:

F_{t} = SB UILD T REE

(

⟦ G_{t - 1} ⟧, ⟦ H_{t - 1} ⟧

)
11: //Predict using the current tree
12:

⟦ {\hat{Y}}_{t} ⟧ = SP RED T REE

(

F_{t}, X

)
13: TreeList

. append (F_{t})

14:

⟦ \hat{Y} ⟧ = ⟦ \hat{Y} ⟧ + ⟦ {\hat{Y}}_{t} ⟧

15: end
16: end
17: return TreeList
18: end

Once the LbpXgbTrain protocol is executed, the returned parameters are encrypted and sent to server C for the rest of the participants to join the process. The two protocols SBuildTree and SPredTree are then iteratively executed by all the participants and the servers to complete the model learning process. At each iteration t, the SBuildTree is invoked to securely build a tree, while the SPredTree is invoked to make predictions using the built tree at t. Finally, the protocol returns a trained model TreeList. The details are presented in the subsequent sections.

6. Scalable and Secure Multi-Party XGBoost Building

This section presents the building of our proposed SSXGB model over the vertically partitioned dataset. We specify that all participants bear distinct sets of data features. The LBP bears the label feature. We also assume that the participants have the same samples. We emphasize that server C operates only on encrypted data and does not have direct access to the participants’ data, including the label.

6.1. The First Secure Tree Building by LBP

As stated in the previous section, to prevent participants from inferring on the label information, we adopt the proposal of [16] in which the LBP builds the first tree. The LbpXgbTrain function shown in Algorithm 3 is invoked for that purpose. The LbpXgbTrain function takes as input the dataset (X

^{l b p}

, Y), where X

^{l b p}

is the feature matrix of the LBP and Y is the label information. In other words, the LBP does not require the other participants’ data to build the first tree. Since there is no collaboration in building the first tree, the sub-routines: ComputeBaseScore, BuildTree and PredTree for computing the base score, building the first tree and making the initial predictions, respectively, are consistent with the mechanisms of XGBoost [19].

Algorithm 3: The First Tree Building by the LBP

1: function LbpXgbTrain(

X^{l b p}, Y

)
2: //Input: X

^{l b p}

:

{x_{M \times N_{l b p}}^{i, j}}

where

N_{l b p}

is the number of features borne by the LBP
3: //Input: Y:

{y}_{i = 1}^{M}

is the label
4: //Compute the base score
5:

F_{0} = C OMPUTE B ASE S CORE (Y)

6: TreeList = []
7: //Initial prediction
8:

\hat{Y} = F_{0}

9: Compute: G

_{0}

:

\sum_{i} g_{0}

and H

_{0}

:

\sum_{i} h_{0}

10: //Construct a tree using

G_{0}

and

H_{0}

11:

F_{1} = B UILD T REE

(

G_{0}, H_{0}

)
12: //Predict using the tree

F_{1}

13:

{\hat{Y}}_{1} = P RED T REE

(

F_{1}, X^{l b p}

)
14:

\hat{Y} = \hat{Y} + {\hat{Y}}_{1}

15: //Encrypt the base score and the tree node values, and update the model list
16: TreeList

. {append (Enc (F}_{0})_{p k_{l b p}}, {Enc (F}_{1})_{p k_{l b p}})

17: //Encrypt the label information and the updated prediction matrix
18: Enc(

{Y)}_{p k_{l b p}}

, Enc(

\hat{Y})_{p k_{l b p}}

19: return TreeList, Enc(

{Y)}_{p k_{l b p}}

, Enc(

\hat{Y})_{p k_{l b p}}

20: end

Once the first tree is constructed, LBP encrypts the base score and the node values of the tree with its public key (

p k_{l b p}

). Next, it updates the model with the encrypted base score and the tree. It also encrypts the label information and the prediction matrix with its public key. Finally, it returns TreeList, Enc(

{Y)}_{p k_{l b p}}

and Enc(

\hat{Y})_{p k_{l b p}}

. The returned parameters are sent to server C to continue with the model training.

6.2. Secure BuildTree and PredTree

Once the first tree is built by the LBP and the results are returned to server C, the rest of the participants can join to continue with the model training. First, the servers C and S collaboratively transform the parameters from the LBP encrypted under the public key

p k_{l b p}

to be under the joint public key

p k_{Σ}

using the KeyProd sub-protocol discussed in Section 3. For example, the encrypted label information Enc(

{Y)}_{p k_{l b p}}

is transformed to ⟦Y⟧, the encrypted prediction matrix Enc(

\hat{Y})_{p k_{l b p}}

is transformed to

⟦ \hat{Y} ⟧

, etc. Then server C computes the first- and second-order derivatives using the encrypted parameters for all the instances. Next, the joint model building resumes using the SBuildSPredTree protocol shown in Algorithm 4.

Algorithm 4: Secure Build and Pred Tree

1: function SBuildSPredTree(

⟦ G_{t - 1} ⟧, ⟦ H_{t - 1} ⟧

)
2: //Input:-

⟦ G_{t - 1} ⟧

:

{⟦ g_{t - 1} ⟧}_{i = 1}^{M}

, ⟦H

_{t - 1} ⟧

:

{⟦ h_{t - 1} ⟧}_{i = 1}^{M}

3: /* Computed by Server C */
4: if RootNode==None then
5: //Register the current node as the root node
6: RootNode=CurrentNode
7: end
8: /* Computed at each Participant i */
9: foreach feature j do
10: Propose split candidates

{x}^{j, 0}, \dots, {x}^{j, K}

11: end
12: foreach

{x}^{j, k}

do
13: Compute

⟦ G_{t - 1}^{j, k} ⟧

and

⟦ H_{t - 1}^{j, k} ⟧

14: Create a lookup table and record

j, k

and

{x}^{j, k}

in the table
15: Create a tuple (

j, k

)
16: end
17: Send the tuple ((

j, k

),

⟦ G_{t - 1}^{j, k} ⟧, ⟦ H_{t - 1}^{j, k} ⟧

) to Server C
18: /* Computed by Server C */
19:

j_{o p t}, k_{o p t}

= SSplitNode(

⟦ G_{t - 1}^{j, k} ⟧, ⟦ H_{t - 1}^{j, k} ⟧

,

T

)
20: /* Computed at the optimal Participant */
21: Receive the optimal

j_{o p t}, k_{o p t}

from Server C
22: Check the lookup for

{x}^{j, k}

associated with

j_{o p t}, k_{o p t}

23: Partition I based on

{x}^{j, k}

24: Record the instance space I

_{L}

with Server C
25: end

The SBuildSPredTree protocol is executed by multiple entities. The algorithm takes in as input the encrypted first- and second-order derivatives computed by server C. The execution begins in server C, which assigns the current node as the root node if there exists no root node. Next, each participant, including the LBP, proposes split candidate values for each of their features as in [19]. For ease of understanding, we shall refer to all the participants, including the LBP as participant i. For each split candidate, each participant i computes

⟦ G_{t - 1} ⟧, ⟦ H_{t - 1} ⟧

, which are associated with the left branch according to [19]. Each participant i also creates a lookup table to record the split candidate information. Then, for each split candidate, each participant i sends the parameter tuple

T

= ((

j, k), ⟦ G_{t - 1}^{j, k} ⟧, ⟦ H_{t - 1}^{j, k} ⟧

) to the server C.

Server C then securely identifies the optimal split feature and value using the SSplitNode algorithm and sends the result to the participant bearing the feature (optimal participant). The optimal participant then decrypts the optimal value for the optimal feature and splits the current node’s instance space accordingly. Finally, the optimal participant registers the left branch instance space I

_{L}

with server C after the partition.

6.3. Secure Node Split Decision

From Equation (6), it can be observed that the optimal split can be obtained if the values of

G_{L} and H_{L}

, and

G_{R} and H_{R}

can be obtained. Hence, the secure split finding algorithm SSplitNode shown in Algorithm 5 takes as input the first and second encrypted derivatives

⟦ G_{t - 1} ⟧

and

⟦ H_{t - 1} ⟧

and the parameter tuple

T

. In this context, we simply use

⟦ G ⟧

and

⟦ H ⟧

for

⟦ G_{t - 1} ⟧

and

⟦ H_{t - 1} ⟧

, respectively.

Algorithm 5: Securely Finding Node Split

1: function SSplitNode(

⟦ G_{t - 1} ⟧, ⟦ H_{t - 1} ⟧

,

T

)
2: //Input:- ⟦G

_{t - 1} ⟧

:

{⟦ g_{t - 1} ⟧}_{i = 1}^{M}

, ⟦H

_{t - 1} ⟧

:

{⟦ h_{t - 1} ⟧}_{i = 1}^{M}

, and

T

.
3: /*Collaboratively Computed by Server S and Server C*/
4: if CurrentNode==LeafNode then
5: //Compute the weight of the leaf node
6:

⟦ w^{*} ⟧

= Div

(Σ_{i} {⟦ g ⟧}^{i}, (Σ_{i} {⟦ h ⟧}^{i} + ⟦ λ ⟧))

7:

{⟦ \hat{y} ⟧}^{i}

=

⟦ w^{*} ⟧

8: return

⟦ \hat{Y} ⟧

9: end
10: //Compute the CurrentNode’s gain (

c g a i n

)
11:

⟦ G ⟧ = Σ_{i} {⟦ g ⟧}^{i}

12:

⟦ H ⟧ = Σ_{i} {⟦ h ⟧}^{i}

13:

⟦ c g a i n ⟧ = Div (Mult (⟦ G ⟧, ⟦ G ⟧), (⟦ H ⟧ + ⟦ λ ⟧))

14: //Initialize the gain dictionary
15:

g a i n D i c t = {}

16: //Enumerate all the Participants
17: for p =

0, \dots, P

do
18: //Enumerate all the features of a Participant
19: for j =

0, \dots, J

do
20: //Enumerate all the proposed thresholds
21: for k =

0, \dots, K

do
22: Receive

⟦ G_{L} ⟧

and

⟦ H_{L} ⟧

from a Participant
23: //Compute the first derivative for the right branch
24:

⟦ G_{R} ⟧ = Sub (⟦ G ⟧, ⟦ G_{L} ⟧)

25: //Compute the second derivatives for the right branch
26:

⟦ H_{R} ⟧ = Sub (⟦ H ⟧, ⟦ H_{L} ⟧)

27: //Compute gains
28:

⟦ l g a i n ⟧ = Div ((⟦ G_{L} ⟧)^{2}, (⟦ H_{L} ⟧ + ⟦ λ ⟧))

29:

⟦ r g a i n ⟧ = Div ((⟦ G_{R} ⟧^{)^{2}}, (⟦ H_{R} ⟧ + ⟦ λ ⟧))

30:

⟦ g a i n ⟧ = l g a i n + (Sub (r g a i n, c g a i n))

31: //Update the gain dictionary
32:

g a i n D i c t [(p, j, k)] = ⟦ g a i n ⟧

33: end
34: end
35: end
36:

j_{o p t i m a l}, k_{o p t i m a l}

=

\underset{j, k}{Sargmax}

(

g a i n D i c t

)
37: return

j_{o p t}, k_{o p t}

to optimal participant p
38: Server C receives I

_{L}

from optimal participant p
39: Server C partitions its instance space into I

_{L}

and I

_{R}

40: Server C associates the CurrentNode with the optimal participant p as [

p : N o d e^{j, k}

]
41: end

First, the algorithm returns and stores the encrypted prediction matrix if the current node is a leaf node (shown in lines 4–9 of Algorithm 5). Otherwise, the algorithm proceeds to securely identify the optimal score. It enumerates all the participants, their features and the proposed encrypted split candidates for each of the features. For each proposed split candidate, the algorithm computes an encrypted gain (shown in lines 21–30 of Algorithm 5). The encrypted gains for all the proposed split candidates are stored in a dictionary. The algorithm then executes the Sargmax primitive algorithm to identify the optimal feature

j_{o p t}

and the threshold value

k_{o p t}

. Next, server C sends the optimal parameters

j_{o p t}

and

k_{o p t}

to the optimal participant bearing the pair

j_{o p t} - k_{o p t}

. Then server C receives I

_{L}

from the optimal participant and uses it to split its instance space into I

_{L}

and I

_{R}

. Server C then stores the current node and associates it with the optimal participant.

7. Secure Prediction

Our proposed secure prediction algorithm is collaboratively executed by all the entities, as shown in Algorithm 6. The SPredict algorithm takes as input the trained model TreeList and a record to be predicted

x_{1 \times N}^{i, j}

with N number of features. Suppose

x_{1 \times N}^{i, j}

is held by a client c with a public key

p k_{c}

. To make predictions on the record, the client first encrypts the record with his public key as

Enc {(x_{1 \times N}^{i, j})}_{p k_{c}}

and sends it to server C. This preserves the privacy of the record from server C. Next, server C compares and passes the record down the tree, starting from the root node. At each node, server C identifies the participant p holding the node, and the

j, k

pair associated with the node. Server C then transforms the record value for the feature j of the participant p’s to be under the public key of the participant p as

Enc {(x^{i, j})}_{p k_{p}}

(shown in lines 12–17 of Algorithm 6). Server C then sends

Enc {(x^{i, j})}_{p k_{p}}

to the participant p. Next, p decrypts the

Enc {(x^{i, j})}_{p k_{p}}

using his secret key and compares the value with the node’s threshold in plaintext. Depending on the comparison result, the participant decides on whether to follow the left or the right child nodes of the current node and sends the decision to server C. Server C repeats the process until a leaf node is reached. Once a leaf node is reached, the algorithm returns the encrypted weight ⟦w⟧ of the leaf node stored in server C as its result. The final prediction result is obtained by cumulating the predictions of all the trees in TreeList.

Algorithm 6: Secure Prediction Algorithm

1: function SPredict(TreeList,

x_{1 \times N}^{i, j}

)
2: //Input:-TreeList and

x_{1 \times N}^{i, j}

, where

N

is the number of features for the record
3: /* Computed by all the entities */
4: //Client c encrypt the values of the record
5:

Enc {(x_{1 \times N}^{i, j})}_{p k_{c}}

6: Send the encrypted record to the Server C
7: while True do
8: //Start from the RootNode
9: if CurrentNode

! =

LeafNode then
10: Server C identifies feature j for the split at CurrentNode
11: Server C identifies the Participant p bearing the feature j
12: //Transform the encryption to be under the public key of the Participant p
13:

Enc (x^{i, j} + r)_{p k_{c}} \leftarrow Enc (x^{i, j})_{p k_{c}} + {Enc (r)}_{p k_{c}}

//

r \leftarrow Z_{N}

14: Server C send

Enc (x^{i, j} + r)_{p k_{c}}

to Server S
15: Server S decrypts

Enc (x^{i, j} + r)_{p k_{c}}

using mDec
16: Server S re-encrypts as

Enc (x^{i, j} + r)_{p k_{p}}

and sends to Server C
17: Server C extracts

Enc (x^{i, j})_{p k_{p}}

as:

Enc (x^{i, j})_{p k_{p}} \leftarrow Enc (x^{i, j} + r)_{p k_{p}} - {Enc (r)}_{p k_{p}}

18: //Collaboration with the participant bearing the optimal feature for the current node
19: Server C sends

Enc (x^{i, j})_{p k_{p}}

to the Participant p bearing the feature j
20: Participant p decrypts

Enc (x^{i, j})_{p k_{p}}

and compares it with the threshold value at the node
21: Based on whether

x^{i, j}

is greater or less than the threshold, Participant p decides on the tree branch to follow
22: Participant p forwards the decision to Server C to continue with the process at NextNode
23: //Update the CurrentNode
24: CurrentNode = NextNode
25: else
26: Return the encrypted weight ⟦w⟧ of the LeafNode
27: end
28: end

8. Analysis

8.1. Security Analysis

We consider the semi-honest (and non-colluding) model in our security analysis, i.e., we consider the scenario where all the entities adhere to the protocols but try to gather information about the other entities’ input and intermediate parameters as much as they can.

8.1.1. Security of BCP Sub-Protocols

First, we present the security analysis of all the sub-protocols used in this work. The security of the KeyProd, Add, Mult, TransDec, Sub, Exp and Neg sub-protocols under the semi-honest model have already been proven in [27,28].

Div: Similar to the other sub-protocols, the security of the Div sub-protocol is based on blinding or masking the plaintext. Given the ciphertexts (numerator and denominator), we employ the properties of the homomorphic encryption to blind the ciphertexts with random elements. These random elements serve as keys. When server S decrypts the ciphertexts, without knowing the random blinding elements, it cannot obtain any information about the nominator and the denominator. They look random. On the other hand, since server C does not have access to the decryption key, it also does not obtain any information about the ciphertexts. Note that we assume the two servers do not collude. Thus, the Div sub-protocol is secure in the semi-honest and non-colluding security model.

LGT and Sargmax: The security of the LGT sub-protocol is based on the fact that server S only computes the difference between two data values. Thus, server S obtains no information about the actual data. However, server C can learn the relational information of the comparison process without learning the actual data. This is a partial information leakage and makes the sub-protocol partially secure in the semi-honest and non-colluding model. Therefore, our proposed Sargmax sub-protocol, which relies on the LGT is also partially secure in the semi-honest and non-colluding model.

8.1.2. Security of SSXGB

The security analysis of the SSXGB can be split into three parts: server S part, server C part and participant part.

Server S part: Server S does not have access to the sample and feature space. It only collaborates with Server C in performing computations. However, the computations performed by server S are on masked values. Thus, no information is leaked to server S.

Server C part: Server C has access to the sample and feature space, and it stores leaf nodes. It also knows which participant holds which feature. However, the intermediate values it has access to are encrypted, and it only performs computations on the encrypted values. Although it stores the leaf nodes, the leaf values are kept in encrypted form. Thus, no information on the actual leaf values is leaked to server C. During the node split process, illustrated by Algorithm 5, comparison results leak to server C. However, server C does not learn the actual values being compared, thus, making it difficult to re-construct the feature values from the leaked information. We, therefore, believe that the benefits of this framework outweigh this partial leakage.

Participant part: Each participant has access to the sample space for each split. Each participant knows the intermediate nodes it holds. However, each participant does not know the actual values of the intermediate parameters apart from the LBP immediately after the construction of the first tree. The participants also do not have access to the leaf nodes. Thus, no information leaks to the participants.

Since there is no information leakage in the involved entities, the proposed SSXGB is secure in the semi-honest and non-colluding model.

8.2. Communication Overhead Analysis

We analyze the communication overhead in terms of analyzing the communication costs associated with a single split. Here, we look at server C-participant and server C-server S communication costs.

server C-participant communication cost: The server C-participant communication cost is similar to that of [16]. Given

ζ

as the ciphertext size, n as the number of samples for the current node, q as the number of samples in a bucket and d as the number of features held by a participant, the communication cost can be computed as

2 \times ζ \times d \times (n / q)

.

server C-server S communication cost: During each split, the server C-server S communication cost can be computed as

12 \times ζ + (3 \times ζ \times (n / q) \times D)

, where D is the total number of features. Our proposed SSXGB experiences fairly heavy communication overhead during the collaborative computation of gains by the two servers.

9. Performance Evaluation

This section presents experiments to demonstrate the effectiveness and efficiency of the proposed SSXGB.

9.1. Experimental Setup

9.1.1. Hardware and Software

All the experiments were performed using a desktop computer with Intel Core i5-6500 CPU with 3.20 GHz × 4 speed and 16 GB RAM running an Ubuntu 20.04 operating system (Canonical Ltd., London, UK). We used Python 3.8.5 and gmpy2 library for the implementation. We also used Cython 0.29.23 to speed up sections of the code. The participant and server functionalities were all executed on the same computer. Thus, latency is ignored in the experiments.

9.1.2. Datasets

We experimented using two datasets: Iris [33] and MNIST [34]. The Iris dataset comprises three classes of iris plants, each with 50 instances. Each instance bears the features of sepal length, sepal width, petal length and petal width. Thus, the dataset contains 150 instances with 4 features (150 × 4).

The MNIST dataset consists of 70,000 samples of gray-scale handwritten images of digits (0–9). The training set contains 60,000 samples, while the test set contains 10,000 samples. Each gray-scale sample has 28 × 28 (784) pixels. Thus, the training set is (60,000 × 784) and the test set is (10,000 × 784).

9.2. Evaluation of SSXGB

First, we evaluate the regression effectiveness of the proposed SSXGB with the two datasets. The effectiveness is measured in terms of regression accuracy and loss.

Setup Configuration: We simulated the two servers, four (4) participants for the Iris dataset and sixteen (16) participants for the MNIST dataset. In each case, one of the participants serves as an LBP. We partitioned the datasets vertically and shared the features between the participants, i.e., for the Iris dataset, each participant held one (1) feature and for the MNIST dataset, each participant held forty-nine (49) features (Each image sample of MNIST contains

28 \times 28 = 784

pixels, and we consider each pixel to be a feature. Thus, each image sample contains 784 features. When partitioned equally among 16 participants, each participant ends with 49 features). Since we performed regression analysis, we considered only two classes. Thus, we take “Iris-setosa” as positive for the Iris dataset and the rest of the classes as negative. To make the dataset balanced, we synthetically generated 50 more instances of the “Iris-setosa” class. Meanwhile, for the MNIST dataset, we take the digits 0–4 as positive and 5–9 as negative. During each simulation, we set the learning rate as 0.08, and the sampling rate as 0.8 for both samples and features. We set maximum depths as 6 and 4 for the MNIST and Iris datasets experiments, respectively. We employed the approximated sigmoid function in [35].

Results:Figure 3 presents the accuracy and loss against the training round with the Iris and MNIST datasets. We can observe that the proposed SSXGB converges slowly in comparison with the XGBoost scheme, and it is slightly less effective as compared to the XGBoost, i.e., the accuracy of 0.9789 vs. 0.9826 for the MNIST dataset and 0.97712 vs. 0.97778 for the Iris dataset. The accuracy loss in SSXGB can be attributed to the use of the approximated sigmoid function and the restrictive nature of training the first tree. However, the slight loss in accuracy is totally acceptable.

9.3. Privacy-Preservation Computation Overhead

Since HE is crucial for privacy preservation in our proposed SSXGB, identifying a suitable configuration for the BCP scheme was paramount. Thus, we implemented the BCP scheme and measured the computation times of its algorithms and sub-protocols under different key sizes. For each operation, the experiment was repeated 20 times and we obtained the average computation times.

Results: The results are shown in Table 1. We can see that the computation times increase with the increase in key size for all the algorithms and sub-protocols. Amongst the BCP algorithms, the mDec is computationally the most demanding, while encryption is computationally the least demanding. Meanwhile, amongst the sub-protocols, the Mult is computationally the most demanding while KeyProd is computationally the least demanding. For the KeyProd, we only considered a joint public key from two users and the joint public key generation does not involve encryption of data as in [28]. Furthermore, for the TransDec, we considered only two users. Based on the shown computation times and security requirements, we considered the key size of 1024 during the implementation of the SSXGB.

Much as the HE provides the required security and utility, it significantly increases the computation overhead. Adopting a differential privacy mechanism [36] can be considered a possible solution, so long as the right balance between utility and security is identified, which can be challenging.

9.4. Efficiency of SSXGB

Finally, we investigate the efficiency of the proposed SSXGB by examining its running time. We performed the investigation under a varying number of participants. We considered the running time for training using the two datasets. Figure 4 shows the running time of training using the MNIST and Iris datasets against the varying number of participants. Since we intended to examine the running time comparatively for the datasets, and the Iris dataset has only 4 features, we limited the number of participants to only 4. In Figure 4, it can be observed that there is no significant change in running time against the varying number of participants. We can attribute this to the number of features and samples remaining almost the same under the vertically partitioned datasets; hence, the number of feature and sample operations remain fairly constant. However, the running time is higher for the MNIST dataset compared to the Iris dataset. This is mainly because the MNIST dataset has more features and samples as compared to the Iris dataset.

10. Conclusions

In this work, we proposed the SSXGB framework for scalable and privacy-preserving multi-party gradient tree boosting over vertically partitioned datasets with outsourced computations. We adopted BCP HE for secure computations and proposed sub-protocols based on the BCP HE for two non-linear operations of gradient tree boosting. Analysis of the framework shows that only minimal information is leaked to the entities under the semi-honest and non-colluding security model. However, the benefits of the framework outweigh this weakness. We also implemented the secure training algorithm of the SSXGB framework. We performed experiments using two real-world datasets. The results show that the SSXGB is scalable and reasonably effective. In the future, we shall aim to minimize the communication overhead of the proposed framework and perform more complicated experiments on more real-world datasets.

Author Contributions

Conceptualization, K.E.; methodology, K.E. and J.W.K.; software, J.W.K.; validation, K.E. and J.W.K.; formal analysis, K.E.; investigation, K.E. and J.W.K.; resources, J.W.K.; data curation, K.E. and J.W.K.; writing—original draft preparation, K.E.; writing—review and editing, J.W.K.; visualization, K.E.; supervision, J.W.K.; project administration, J.W.K.; funding acquisition, J.W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF2020R1F1A1072622).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Details of the `LGT` Sub-Protocol

The LGT sub-protocol is an adaptation of the SLT sub-protocol in [29]. The only difference is that we intentionally reveal the result of the comparison to the server C. An illustration of the sub-protocol is shown in Figure A1. Given two ciphertexts

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

encrypted under the joint public key

p k_{Σ}

. To determine if

⟦ m_{1} ⟧

≥

⟦ m_{2} ⟧

or

⟦ m_{1} ⟧

<

⟦ m_{2} ⟧

, the following procedures are collaboratively performed by servers C and S.

Figure A1. The LGT sub-protocol.

As in [29], first server C uses the Exp sub-protocol to multiply the ciphertexts by two as

⟦ 2 m_{1} ⟧

and

⟦ 2 m_{2} ⟧

. Server C also encrypts 1 using the Enc algorithm as ⟦1⟧, and adds the result to

⟦ 2 m_{1} ⟧

as

⟦ 2 m_{1} + 1 ⟧

.

Server C then flips a coin s. If

s = 1

, server C uses the Sub sub-protocol and computes

⟦ l ⟧

as (⟦2

m_{1}

+ 1⟧ − ⟦2

m_{2}

⟧). Otherwise,

⟦ l ⟧

is computed as (⟦2

m_{2}

⟧ − ⟦2

m_{1}

+ 1⟧). Server C then sends

⟦ l ⟧

to server S.

Next, server S uses the mDec algorithm to decrypt

⟦ l ⟧

as l.

L (l)

is then computed by server S as in [29] (

L (x)

denotes the bit length of x. See [29] for the details). If

L (l) > N / 2

, server S sets

u^{'} = 1

, otherwise

u^{'} = 0

. Server S then returns

u^{'}

to server C.

Next, server C checks for s, and if

s = 1

, server C sets

u^{*} = u^{'}

. Otherwise,

u^{*} = 1 - u^{'}

. If

u^{*} = 0

,

m_{1} \geq

m_{2}

, otherwise

m_{1} <

m_{2}

.

Appendix B. Proof of Correctness for the `Div` Sub-protocol

Consider the division of

⟦ m_{1} ⟧

by

⟦ m_{2} ⟧

(

⟦ m_{1} \div m_{2} ⟧

) using the Div sub-protocol. For the purpose of this proof, we assume

⟦ m_{1} ⟧

and

⟦ m_{2} ⟧

are encrypted under the same public key. First, the server C randomly selects

τ_{_{1}}, τ_{_{2}} \in Z_{N}

. Using the Exp sub-protocol, the server C generates

⟦ τ_{_{1}} m_{1} ⟧

,

⟦ τ_{_{2}} m_{2} ⟧

and

⟦ τ_{_{1}} m_{2} ⟧

as follows:

\begin{matrix} ⟦ τ_{_{1}} m_{1} ⟧ & = (g^{τ_{_{1}} \cdot r_{_{1}}} \mod N^{2}, h^{τ_{_{1}} \cdot r_{_{1}}} (1 + τ_{1} \cdot m_{1} \cdot N) \mod N^{2}) \\ ⟦ τ_{_{2}} m_{2} ⟧ & = (g^{τ_{_{2}} \cdot r_{_{2}}} \mod N^{2}, h^{τ_{_{2}} \cdot r_{_{2}}} (1 + τ_{_{2}} \cdot m_{2} \cdot N) \mod N^{2}) \\ ⟦ τ_{_{1}} m_{2} ⟧ & = (g^{τ_{_{1}} \cdot r_{_{3}}} \mod N^{2}, h^{τ_{_{1}} \cdot r_{_{3}}} (1 + τ_{_{1}} \cdot m_{2} \cdot N) \mod N^{2}) \end{matrix}

(A1)

Using the Add sub-protocol, the server C computes

⟦ τ_{_{1}} m_{1} + τ_{_{2}} m_{2} ⟧

as:

\begin{matrix} ⟦ τ_{_{1}} m_{1} + τ_{_{2}} m_{2} ⟧ & = (g^{(τ_{_{1}} r_{_{1}} + τ_{_{2}} r_{_{2}})} \mod N^{2}, \\ h^{(τ_{_{1}} r_{_{1}} + τ_{_{2}} r_{_{2}})} (1 + (τ_{_{1}} m_{1} + τ_{_{2}} m_{2}) N) \mod N^{2}) \end{matrix}

(A2)

The server C then sends

⟦ τ_{_{1}} m_{2} ⟧

and

⟦ τ_{_{1}} m_{1} + τ_{_{2}} m_{2} ⟧

to the server S. Using the mDec algorithm, the server S decrypts

⟦ τ_{_{1}} m_{2} ⟧

and

⟦ τ_{_{1}} m_{1} + τ_{_{2}} m_{2} ⟧

as

τ_{_{1}} m_{2}

and

τ_{_{1}} m_{1} + τ_{_{2}} m_{2}

, respectively. In plaintext, the server S then computes (

\frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}}

) as:

(\frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}}) = \frac{1}{τ_{_{1}} m_{2}} (τ_{_{1}} m_{1} + τ_{_{2}} m_{2})

(A3)

Next, the server S encrypts

(\frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}})

to

⟦ \frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}} ⟧

as:

⟦ \frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}} ⟧ = (g^{r} \mod N^{2}, h^{r} (1 + (\frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}}) . N) \mod N^{2})

(A4)

and sends the result to server C.

After receiving

⟦ \frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}} ⟧

, the server C computes

\frac{τ_{_{2}}}{τ_{_{1}}}

in plaintext and encrypts it as

⟦ \frac{τ_{_{2}}}{τ_{_{1}}} ⟧

. The server C extracts

⟦ \frac{m_{1}}{m_{2}} ⟧

using the Sub sub-protocol as:

⟦ \frac{m_{1}}{m_{2}} ⟧ = Sub (⟦ \frac{m_{1}}{m_{2}} + \frac{τ_{_{2}}}{τ_{_{1}}} ⟧, ⟦ \frac{τ_{_{2}}}{τ_{_{1}}} ⟧) .

(A5)

References

Gong, M.; Feng, J.; Xie, Y. Privacy-enhanced multi-party deep learning. Neural Netw. 2020, 121, 484–496. [Google Scholar] [CrossRef] [PubMed]
Cock, M.d.; Dowsley, R.; Nascimento, A.C.; Newman, S.C. Fast, privacy preserving linear regression over distributed datasets based on pre-distributed data. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, Denver, CO, USA, 12–16 October 2015; pp. 3–14. [Google Scholar]
Hall, R.; Fienberg, S.E.; Nardi, Y. Secure multiple linear regression based on homomorphic encryption. J. Off. Stat. 2011, 27, 669. [Google Scholar]
Kim, M.; Song, Y.; Wang, S.; Xia, Y.; Jiang, X. Secure logistic regression based on homomorphic encryption: Design and evaluation. JMIR Med. Inform. 2018, 6, e19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Aono, Y.; Hayashi, T.; Phong, L.T.; Wang, L. Privacy-preserving logistic regression with distributed data sources via homomorphic encryption. IEICE Trans. Inf. Syst. 2016, 99, 2079–2089. [Google Scholar] [CrossRef] [Green Version]
Zheng, L.; Chen, C.; Liu, Y.; Wu, B.; Wu, X.; Wang, L.; Wang, L.; Zhou, J.; Yang, S. Industrial scale privacy preserving deep neural network. arXiv 2020, arXiv:2003.05198. [Google Scholar]
Phong, L.T. and Phuong, T.T.Privacy-preserving deep learning via weight transmission. IEEE Trans. Inf. Forensics Secur. 2019, 14, 3003–3015. [Google Scholar] [CrossRef] [Green Version]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Minastireanu, E.A.; Mesnita, G. Light gbm machine learning algorithm to online click fraud detection. J. Inform. Assur. Cybersecur. 2019, 2019, 263928. [Google Scholar] [CrossRef]
Punmiya, R.; Choe, S. Energy theft detection using gradient boosting theft detector with feature engineering-based preprocessing. IEEE Trans. Smart Grid 2019, 10, 2326–2329. [Google Scholar] [CrossRef]
Wang, Y.; Feng, D.; Li, D.; Chen, X.; Zhao, Y.; Niu, X. A mobile recommendation system based on logistic regression and gradient boosting decision trees. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 1896–1902. [Google Scholar]
Li, Q.; Wen, Z.; He, B. Practical federated gradient boosting decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4642–4649. [Google Scholar]
Ong, Y.J.; Zhou, Y.; Baracaldo, N.; Ludwig, H. Adaptive Histogram-Based Gradient Boosted Trees for Federated Learning. arXiv 2020, arXiv:2012.06670. [Google Scholar]
Liu, Y.; Ma, Z.; Liu, X.; Ma, S.; Nepal, S.; Deng, R. Boosting privately: Privacy-preserving federated extreme boosting for mobile crowdsensing. arXiv 2019, arXiv:1907.10218. [Google Scholar]
Cheng, K.; Fan, T.; Jin, Y.; Liu, Y.; Chen, T.; Yang, Q. Secureboost: A lossless federated learning framework. arXiv 2019, arXiv:1901.08755. [Google Scholar] [CrossRef]
Feng, Z.; Xiong, H.; Song, C.; Yang, S.; Zhao, B.; Wang, L.; Chen, Z.; Yang, S.; Liu, L.; Huan, J. Securegbm: Secure multi-party gradient boosting. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 1312–1321. [Google Scholar]
Fang, W.; Chen, C.; Tan, J.; Yu, C.; Lu, Y.; Wang, L.; Wang, L.; Zhou, J.; Liu, A.X. A hybrid-domain framework for secure gradient tree boosting. arXiv 2020, arXiv:2005.08479. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Phong, L.T.; Aono, Y.; Hayashi, T.; Wang, L.; Moriai, S. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forensics Secur. 2017, 13, 1333–1345. [Google Scholar] [CrossRef]
Shokri, R.; Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 1310–1321. [Google Scholar]
Leung, C. Towards Privacy-Preserving Collaborative Gradient Boosted Decision Tree Learning. 2020. Available online: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-100.pdf (accessed on 4 August 2021).
Yang, M.; Song, L.; Xu, J.; Li, C.; Tan, G. The tradeoff between privacy and accuracy in anomaly detection using federated XGBoost. arXiv 2019, arXiv:1907.07157. [Google Scholar]
Wang, Z.; Yang, Y.; Liu, Y.; Liu, X.; Gupta, B.B.; Ma, J. Cloud-based federated boosting for mobile crowdsensing. arXiv 2020, arXiv:2005.05304. [Google Scholar]
Deforth, K.; Desgroseilliers, M.; Gama, N.; Georgieva, M.; Jetchev, D.; Vuille, M. XORBoost: Tree boosting in the multiparty computation setting. Cryptol. ePrint Arch. Report 2021/432. 2021. Available online: https://eprint.iacr.org/2021/432 (accessed on 15 June 2022).
Carpov, S.; Deforth, K.; Gama, N.; Georgieva, M.; Jetchev, D.; Katz, J.; Leontiadis, I.; Mohammadi, M.; Sae-Tang, A.; Vuille, M. Manticore: Efficient framework for scalable secure multiparty computation protocols. Cryptol. ePrint Arch. Paper 2021/200. 2021. Available online: https://eprint.iacr.org/2021/200 (accessed on 15 June 2022).
Bresson, E.; Catalano, D.; Pointcheval, D. A simple public-key cryptosystem with a double trapdoor decryption mechanism and its applications. In Proceedings of the International Conference on the Theory and Application of Cryptology and Information Security, Taipei, Taiwan, 30 November–4 December 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 37–54. [Google Scholar]
Peter, A.; Tews, E.; Katzenbeisser, S. Efficiently outsourcing multiparty computation under multiple keys. IEEE Trans. Inf. Forensics Secur. 2013, 8, 2046–2058. [Google Scholar] [CrossRef] [Green Version]
Liu, X.; Deng, R.H.; Choo, K.R.; Weng, J. An Efficient Privacy-Preserving Outsourced Calculation Toolkit with Multiple Keys. IEEE Trans. Inf. Forensics Secur. 2016, 11, 2401–2414. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. Available online: http://proceedings.mlr.press/v54/mcmahan17a?ref=https://githubhelp.com (accessed on 20 August 2021).
Truex, S.; Baracaldo, N.; Anwar, A.; Steinke, T.; Ludwig, H.; Zhang, R.; Zhou, Y. A hybrid approach to privacy-preserving federated learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, London, UK, 15 November 2019; pp. 1–11. [Google Scholar]
Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 1175–1191. [Google Scholar]
Fisher, R.A. Iris Data Set. 1936. Available online: https://archive.ics.uci.edu/ml/datasets/iris (accessed on 7 June 2021).
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Zhong, S. Privacy-preserving backpropagation neural network learning. IEEE Trans. Neural Netw. 2009, 20, 1554–1564. [Google Scholar] [CrossRef] [PubMed]
Kim, J.W.; Edemacu, K.; Kim, J.S.; Chung, Y.D.; Jang, B. A Survey Of differential privacy-based techniques and their applicability to location-Based services. Comput. Secur. 2021, 111, 102464. [Google Scholar] [CrossRef]

Figure 1. Illustration of the Div sub-protocol.

Figure 2. Entities of our proposed framework.

Figure 3. Training convergence with the Iris and MNIST datasets.

Figure 4. Running time of the SSXGB with the Iris and MNIST datasets.

Table 1. Computation times of BCP algorithms and sub-protocols.

Operation	Key Size (Bits)
Operation	512	1024	2048
Enc	10.61 ms	12.02 ms	16.98 ms
Dec	18.27 ms	29.13 ms	38.64 ms
mDec	29.34 ms	40.96 ms	63.72 ms
`KeyProd`	0.02 ms	0.05 ms	0.11 ms
`Add`	8.86 ms	10.23 ms	12.61 ms
`Mult`	180.74 ms	259.97 ms	310.03 ms
`TransDec`	161.20 ms	196.69 ms	283.74 ms
`Exp`	2.04 ms	2.32 ms	4.01 ms
`Sub`	8.93 ms	12.06 ms	13.02 ms
`LGT`	120.17 ms	143.72 ms	171.05 ms
`Div`	140.29 ms	183.18 ms	251.67 ms

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Edemacu, K.; Kim, J.W. Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations. Mathematics 2022, 10, 2185. https://doi.org/10.3390/math10132185

AMA Style

Edemacu K, Kim JW. Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations. Mathematics. 2022; 10(13):2185. https://doi.org/10.3390/math10132185

Chicago/Turabian Style

Edemacu, Kennedy, and Jong Wook Kim. 2022. "Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations" Mathematics 10, no. 13: 2185. https://doi.org/10.3390/math10132185

APA Style

Edemacu, K., & Kim, J. W. (2022). Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations. Mathematics, 10(13), 2185. https://doi.org/10.3390/math10132185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations

Abstract

1. Introduction

2. Related Work

2.1. Horizontal Multi-Party Gradient Tree Boosting Frameworks

2.2. Vertical Multi-Party Gradient Tree Boosting Frameworks

2.3. Generic Multi-Party Gradient Tree Boosting Frameworks

3. Preliminaries

3.1. XGBoost

3.2. Homomorphic Encryption (HE)

3.3. BCP Sub-Protocols

4. Proposed Computation Primitives

4.1. Div

4.2. Sargmax

5. Overview of Our Proposed Framework

5.1. Entities of the Framework

5.2. Security Assumptions

5.3. Workflow of Our Proposed Scheme

6. Scalable and Secure Multi-Party XGBoost Building

6.1. The First Secure Tree Building by LBP

6.2. Secure BuildTree and PredTree

6.3. Secure Node Split Decision

7. Secure Prediction

8. Analysis

8.1. Security Analysis

8.1.1. Security of BCP Sub-Protocols

8.1.2. Security of SSXGB

8.2. Communication Overhead Analysis

9. Performance Evaluation

9.1. Experimental Setup

9.1.1. Hardware and Software

9.1.2. Datasets

9.2. Evaluation of SSXGB

9.3. Privacy-Preservation Computation Overhead

9.4. Efficiency of SSXGB

10. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Details of the LGT Sub-Protocol

Appendix B. Proof of Correctness for the Div Sub-protocol

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. `Div`

4.2. `Sargmax`

Appendix A. Details of the `LGT` Sub-Protocol

Appendix B. Proof of Correctness for the `Div` Sub-protocol