Improved Design and Application of Security Federation Algorithm

Yang, Xiaolei; Liu, Yongshan; Xie, Jiabin; Hao, Tianbao

doi:10.3390/electronics12061375

Open AccessArticle

Improved Design and Application of Security Federation Algorithm

by

Xiaolei Yang

,

Yongshan Liu

^*,

Jiabin Xie

and

Tianbao Hao

School of Information Science and Engineering, Yanshan University, Qinhuangdao 066000, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(6), 1375; https://doi.org/10.3390/electronics12061375

Submission received: 17 February 2023 / Revised: 8 March 2023 / Accepted: 11 March 2023 / Published: 13 March 2023

(This article belongs to the Special Issue Intelligent Analysis and Security Calculation of Multisource Data)

Download

Browse Figures

Versions Notes

Abstract

:

(1) Background: To avoid affecting the accuracy and practicability of the security federation model due to the geographical and environmental factors involved in each local model and set the corresponding weights for each local model, the local model parameters and weights participated in the calculation at the same time. (2) Methods: Apply the improved model to the income evaluation of taxi drivers. Multiple linear regression was used to fit the local model parameters, and the loss function value was calculated. Then, according to the improved security federation algorithm, the model parameters and local model weights were encrypted by using the Paillier homomorphic encryption algorithm, and the encrypted model parameter information was uploaded to the aggregation server for aggregation average. (3) Results: The experimental results show that after 1000 iterations, the accuracy curve converges in the interval [0.93, 0.97]; the mean accuracy value was 94.27%, and the mean loss function value was 1.0886. It was the same understanding that the mean value of the loss function calculated by the traditional model was 1.9910. (4) From the model and data, the accuracy of the improved model has been improved. It can better reflect the income of taxi drivers.

Keywords:

federal learning; aggregate average; loss function value; Paillier homomorphic encryption

1. Introduction

With the continuous iterative evolution of new business forms and new service modes, large Internet companies have accumulated a large amount of data using collection, retention, exchange, and derivation in the process of serving users. In the ubiquitous network environment, the frequent cross-border, cross-system, and cross-ecosystem interaction of data has become the norm driven by information services, which increases the intentional or unintentional retention of private information in different information systems, resulting in information leakage [1]. If data are exposed to real-time disclosure, people will gradually become “transparent people”. To prevent information leakage, many enterprises, organizations, and individuals use data only in the domain. They isolate a large amount of data stored by themselves and gradually form a “data island” [2]. This phenomenon is becoming more and more common, which inhibits the development of the purpose of “intelligent interconnection of all things and ubiquitous sharing of information”. Therefore, the problem of data security needs to be solved urgently [3]. Trusted federated learning first solves the problem of data protection. Trusted federated learning is a secure and trusted multiparty machine learning paradigm, which plays a good role in privacy protection in solving practical engineering problems. However, in the federated learning framework, because each participant or each local model is in a different scenario, such as economic environment, geographical location, population density, etc., which affects the differences between models and makes them have certain “external characteristics”, this poses a certain threat and hidden danger to data information protection. Therefore, how to adjust the federated learning algorithm to better protect and ensure that the model information is not attacked and the loss function is minimal becomes the bottleneck of research.

In recent years, many scholars at home and abroad have become more and more enthusiastic about federal learning, and their advocation for federal learning is also growing. In 2019, Professor Yang Qiang proposed the concept of federal learning, including horizontal federal learning, vertical federal learning, and the federal transfer model [4]. To improve the accuracy of ground vehicle GPS positioning, Devon established a horizontal federated learning model and compared the average method, maximum likelihood estimation method, covariance trajectory optimization method, and covariance intersection method. With the increase in the number of systems, the position error decreases (loss function convergence) and higher robustness is obtained [5]. Baranda uses transverse federation learning to describe and categorize athletes’ spine postures and to identify types of sagittal spine morphologies [6]. Due to 5G network processing edge computing resources and facing power supply constraints, Baghban. H proposed a joint multidimensional fractional backpack algorithm to reduce operating costs, which saves 50% energy consumption compared with nonfederated algorithms [7]. Chen Shaoqi proposed FL-QSAR, a prototype platform for modeling collaborative drug discovery QSAR based on federal learning. Using the newly developed collaborative privacy protection learning framework, he broke down barriers between pharmaceutical institutions and promoted drug cooperation and development [8]. Liu Yijing used the distributed learning model of federated learning to propose an efficient RAN slicing device association scheme to improve network throughput while reducing switching costs. Finally, the scheme achieves significant performance improvements in network throughput and communication efficiency [9]. Topaloglu proposed a model to assess the data contributions of federal learning implementation, looking for convenience and adaptability in terms of computational constraints and costs [10]. Zhu Hangyu proposed a neural structure search method based on reinforcement learning, evolutionary algorithm, and gradient [11]. Rajendran’s specific processes for improving performance vary based on ML models and how federal learning is implemented, and it has been demonstrated that the order of institutions during training does affect overall performance improvement [12]. Subramanya designed an exploratory visual analysis system for the horizontal federal learning process, assessing each customer’s contribution, and verifying the effectiveness of the system [13]. Zhu Xiao proposed a horizontal federal PCA differential privacy data publishing algorithm, which effectively protects the privacy of local data and published data and has higher usability than similar algorithms [14]. Lau proposed a multicriteria risk assessment system based on federal learning, which combines federal learning with the best–worst method (BWM), measures enterprise cold chain risk under a proposed risk hierarchy, and enhances the autonomy in assessing cold chain risk [15]. Liu Yinghui proposed an asynchronous convergent federated learning method considering stale coefficients. At the same time, the block chain network was used instead of the classic central server to aggregate the global model with high accuracy [16]. Ge Ning designed the Federal Support Vector Machine and Random Forestry algorithm models, built a production line fault prediction model based on federal learning, and finally showed that federal learning can replace centralized learning for fault prediction [17]. The above studies all show the differences between edge computing and federal computing and play a role in the application of federal learning.

To study the income of yellow taxi drivers in New York City, the traditional horizontal federated learning algorithm is selected. However, due to certain differences among local models, if each local model directly participates in the global evaluation, the universality of the model will be reduced. How to adjust the safety federal model to minimize the loss function has become a bottleneck problem in the research.

Based on horizontal federation learning, this paper improves the security federation algorithm. Before the local model parameters are uploaded to the aggregation server, the weights are added to each participant, which not only realizes the zero differentiation of model parameters but also the “secondary protection” of model data. The weight will be encrypted together with the local model parameters and uploaded to the aggregation server. The weighted average of global model parameters is based on the reweighting of local model parameters to realize the standardized description of practical problems. Next, the improved safety federation model is applied to the income evaluation of yellow taxi drivers in New York City, and the advantages of the improved model are reflected by comparing the value of the loss function with the traditional model.

2. Horizontal Federal Learning Framework

Horizontal federated learning is when the data sets of all participants in federated learning have the same feature space and different sample space, which is similar to the horizontal division of data in the table view. For example, the customer groups of urban commercial banks in the two regions have a small intersection. The IDs of their data sets are different, but the business types are very similar. Therefore, the feature space of their data sets is the same [4]. The formal description is

x_{i} = x_{j}, y_{i} = y_{j}, I_{i} \neq I_{j}, \forall D_{i}, D_{j}, i \neq j

(1)

where

D_{i}

and

D_{j}

represent the data sets owned by Party

i

and Party

j

, respectively. We assume that the data feature space and label space pairs of the two parties, namely

(x_{i}, y_{i})

and

(x_{j}, y_{j})

, are the same. However, the customer ID space of the two parties does not intersect, or the intersection is very small. The specific feature space and sample space of horizontal federated learning are shown in Figure 1:

2.1. Horizontal Federal Learning Training Process

A typical client–server architecture for a horizontal federated learning system is shown in Figure 2.

The training process of a horizontal federated learning system is usually divided into four steps [18]:

Step 1: Each participant calculates the gradient of the model locally, uses encryption technologies such as homomorphic encryption, differential privacy, or secret sharing to mask the gradient information, and sends the masked results to the aggregation server.

Step 2: The server performs security aggregation operations (such as using a weighted average based on homomorphic encryption).

Step 3: The server sends the aggregated results to all participants.

Step 4: Each participant decrypts the gradient after receiving it and updates their model parameters with the decrypted gradient results.

2.2. Global Model

The model performance of the lateral federated learning model was obtained by testing all participants on the test data set. The model performance can be expressed as precision, accuracy, recall, etc. [19].

In the model training process of horizontal federated learning and after the model training, the global model performance is obtained:

Step 1: The

k

-th participant uses the local test data set to evaluate the existing horizontal federated learning model. For the second classification task, this step will generate the local model test result

(N_{T P}^{k}, N_{F P}^{k}, N_{T N}^{k}, N_{F N}^{k})

; Party

k = 1, 2, \dots, K

performs this operation.

Step 2: The

k

-th participant sends the local model prediction result in

(N_{T P}^{k}, N_{F P}^{k}, N_{T N}^{k}, N_{F N}^{k})

to the coordinator. Party

k = 1, 2, \dots, K

performs this operation.

Step 3: After collecting the local model prediction results in

{(N_{T P}^{k}, N_{F P}^{k}, N_{T N}^{k}, N_{F N}^{k})}_{k = 1}^{K}

of K participants, the coordinator can calculate the global model performance test results. For example, for a two-category task, the global recall rate can be calculated by

\frac{\sum_{k = 1}^{K} N_{T P}^{k}}{\sum_{k = 1}^{K} (N_{T P}^{k} + N_{F N}^{k})}

.

Step 4: The coordinator sends the calculated global model performance to all participants.

2.3. Federal Average

The federated averaging algorithm applies to any of the following finite-sum loss functions:

\min_{w \in R^{d}} f (w) = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (w)

(2)

where

n

represents the number of training data, and

w \in R^{d}

represents the model parameters of

d

dimension.

For machine learning,

f_{i} (w) = l (x_{i}, y_{i}, w)

is generally selected,

l (x_{i}, y_{i}, w)

represents the loss result obtained by predicting sample

(x_{i}, y_{i})

on a given model parameter

w

,

x_{i}

and

y_{i}

represent the

i

-th training data point and related labels, respectively.

Suppose there are

K

participants. In a horizontal system learning, let

D_{k}

represent the data set owned by the

k

-th participant and

P_{k}

represent the index set of data points located at the customer

k

.

Let

n_{k} = | P_{k} |

represent the cardinality of

P_{k}

, so, the

k

-th player has

n_{k}

data points. Therefore, when there is a total of

K

participants, the loss function is

f (w) = \sum_{k = 1}^{K} \frac{n_{k}}{n} F_{k} (w)

(3)

F_{k} (w) = \frac{1}{n_{k}} \sum_{i \in P_{k}}^{K} f_{i} (w)

(4)

For many models, the computational cost is minimal for the communication cost, so we need to use additional calculations. To reduce the communication rounds required by the training model, there are two main methods to increase the calculation [4]:

➀: Increase parallelism: Let them conduct model training independently between communication rounds.
➁: Add calculations in each participant: Each participant can perform more complex calculations between two communication rounds, such as multiple local model update iterations, rather than just simple calculations such as gradient calculations for a single batch.

3. Scheme Design Based on Weighted Federal Average

3.1. Objectives and Requirements

To better protect local data, the improved model described in this section takes into account the user data quality and potential security threats in the aggregation process, introduces weight coefficients into the traditional federated average algorithm, and proposes a new client–server architecture and scheme based on a weighted federated average.

Theoretically, the performance of a model and whether it will be applied will be evaluated from two aspects: accuracy and recall. Considering the particularity of the model and application scenarios, the improved client–server framework of horizontal federated learning based on a weighted average should pay attention to privacy, efficiency, and security.

3.2. Specific Scheme Design

The newly introduced weight is calculated according to the contribution of the model parameters trained by the user, and it is modified according to the training times and learning rate of each iteration. Therefore, the weight parameters of the same data trained under different models are different. The specific improvement scheme is as follows.

3.2.1. Identification and Authentication

Identity authentication is an important prerequisite for the whole training. It will accompany the whole training process. The key information given by the third party to the participants and the aggregation server is convenient for the server to determine the clients participating in the training anbd to avoid malicious attacks and enemy attacks, which will affect the global model training. The participant’s user identity information or data characteristics will show multiple groups of characteristic attributes. The key pair used between the user and the server is generated by the key generation center under the identity-based password constitution [16].

3.2.2. Model Initialization

After the identity authentication and key pair are solved, the aggregation server will broadcast the parameters initialized by the local model to all participants who complete the authentication and determine the training model and goal of each user. The participant training model adopts the gradient descent method to complete the local update of parameters. In the process of data aggregation, due to the problem of data quality, the weight parameters are called “seasoning agents” to give full play to the model parameters.

This scheme involves the data protection of participants. Because the whole process requires data encryption, and then adding or multiplying the encrypted ciphertext, the Paillier algorithm is adopted in this section. The new improved scheme proposed in this section adds weights, and the aggregation method is data parameter weighted summation. The Paillier algorithm is just an additive homomorphic encryption algorithm.

The encryption and decryption key for Paillier homomorphic encryption needs to be generated during initialization:

(a): Key generation
Select two large prime numbers $p, q$
Calculate $N = p \cdot q$ and $λ = lcm (p - 1, q - 1)$ .
Select an integer $g \in Z_{N^{2}}^{*}$ , s.t. $\gcd (L (g^{λ} \mod N^{2})$ , $N) = 1$ , that is, they are prime numbers, where $L (u) = \frac{u - 1}{N}$ .
Public key is $〈 N, g 〉$ ; Private key is $λ$ .
(b): Encryption (using public key)
Select a random number; $r \in Z_{N}^{*}$ is the random source of probabilistic encryption.
Plaintext $m$ corresponds to ciphertext:

$c = Enc (m, r) = g^{m} \cdot r^{N} \mod N^{2}$

(5)
(c): Encryption (using public key)
The plaintext $c$ corresponding to the ciphertext is

$m = D e c (c) = \frac{L (c^{λ} \mod N^{2})}{L (g^{λ} \mod N^{2})} \mod N$

(6)

The above is the specific process of generating key, encryption, and decryption by using the Paillier addition homomorphic encryption algorithm.

lcm (a, b)

refers to the minimum common multiple of a and b, and

\gcd (a, b)

refers to the maximum common divisor of a and b.

The security of the algorithm is based on the difficulty of large integer decomposition, that is, we cannot decompose the large integer N to obtain two prime factors

p, q

.

3.2.3. Training Local Model

After the model is initialized, the participants calculate the local optimal value of the model by using the local data set according to the model parameters broadcast by the aggregation server. Before local model training, the aggregation server broadcasts to all participants to initialize the model parameter group

(W_{g}^{0}, η, T)

and the objective function

f (x)

:

f_{w, b} (x) = \sum_{k = 1}^{K} (w_{k} x_{k} + b_{0})

(7)

where

K

represents the number of data sets the user has,

b_{0}

is the threshold,

W_{g}^{0}

represents the initialization parameter,

η

is the learning rate, and

T

is the number of iterations. Loss function

L (f_{w, b} (x), y)

:

\begin{array}{l} L (f_{w, b} (x), y) = \frac{1}{2 n} \sum_{i = 1}^{n} {(f_{w, b} (x) - y_{i})}^{2} \\ = \frac{1}{2 n} \sum_{i = 1}^{n} {(\sum_{k = 1}^{K} (w_{k} x_{k} + b_{0}) - y_{i})}^{2} \end{array}

(8)

To make the model optimal, it needs to go through multiple iterations to minimize the value of the loss function. At this time, the difference between the objective function value and the real value is the smallest. At this time, the higher the accuracy of the model, the more optimal the parameters of the local model, and the objective function and solution process is [17]

\arg \min_{w, b} L (f_{w, b} (x), y) = \arg \min_{w, b} \frac{1}{2 n} \sum_{i = 1}^{n} {(f_{w, b} (x) - y_{i})}^{2}

(9)

Find the partial derivative and the gradient of the loss function:

\begin{array}{l} g_{w_{k}} = \frac{\partial L (f_{w, b} (x), y)}{\partial w_{k}} \\ = \frac{\partial (\frac{1}{2 n} \sum_{i = 1}^{n} (\sum_{k = 1}^{K} {((w_{k} x_{k} + b_{0}) - y_{i})}^{2}))}{\partial w_{k}} \\ = \sum_{k = 1}^{K} ((w_{k} x_{k} + b_{0}) - y_{i}) \frac{\partial \sum_{k = 1}^{K} (w_{k} x_{k} + b_{0})}{\partial w_{k}} \\ = (f_{w_{k}, b} (x) - y) f_{w_{k}, b}^{'} (x) x \end{array}

(10)

3.2.4. Upload Local Model Parameters

Due to the data quality of participants, it is necessary to provide weight for each user and give play to the “parameter advantage” for the federation aggregation of the global model. Weight calculation method:

ψ_{i} = \frac{δ}{\sqrt{\sum_{j = 1}^{d} {(\frac{w_{j}^{i} - w_{g, j}^{i}}{σ_{j}})}^{2}}}

(11)

where

ψ_{i}

represents the parameter weight value of the user in the

i

iteration,

d

represents the number of parameters in each group of data,

w_{j}^{i} - w_{g, j}^{i}

represents the difference between the

j

local parameter and the global parameter in the

i

iteration,

σ_{j}

represents the standard deviation of the

j

parameter, and

δ

is used to adjust the weight value.

Next, according to the secret key group received in the initialization stage, the user encrypts the weight parameters with Paillier homomorphic encryption algorithm to generate ciphertext [20]:

[[w_{i, j}^{k}]] = E n c (ψ_{i} w_{i, j}^{k}, (g, N))

(12)

Then, the ciphertext is uploaded to the aggregation server, and the aggregation server calculates the global model parameters. The aggregation server needs to verify the message source and record the time used by each user in this model training to select the aggregation object. After calculating the ciphertext, the participants need to construct a secure message group to complete the interaction of ciphertext [21]. The detailed structure is as follows:

M = S K ([[w_{i, j}^{k}]] \oplus I D_{i} \oplus n o n c e)

(13)

Among them,

[[w^{k}]]

is the weight parameter ciphertext,

I D_{i}

is the user label, and the nonce is the fresh random number generated in the authentication and key negotiation stage.

3.2.5. Update Global Model Parameters

Aggregate the ciphertext parameters

[[w^{k}]]

uploaded by the user, encrypt the weight parameters with the Paillier homomorphic encryption algorithm to generate ciphertext, and complete the calculation of global parameters:

\begin{array}{l} [[w_{g, j}^{t}]] = Π_{i = 1}^{S} E n c (ψ_{i} w_{i, j}^{t}) = E n c (ψ_{1} w_{1, j}^{t}) \cdot E n c (ψ_{2} w_{2, j}^{t}) \cdot \dots \cdot E n c (ψ_{S} w_{S, j}^{t}) \\ = E n c (ψ_{1} w_{1, j}^{t} + ψ_{2} w_{2, j}^{t} + \dots + ψ_{S} w_{S, j}^{t}) \\ = E n c (\sum_{i = 1}^{S} ψ_{i} w_{i, j}^{t}) \end{array}

(14)

After the calculation, the global model is trained. For the problem of data quality, the weight parameter is added to adjust its proportion in the aggregate average. The aggregation server broadcasts the final global model results to the user. The user uses the key distributed by the third party to decrypt the ciphertext and update the local model. If the training reaches the maximum time, or the accuracy reaches the preset range, and the number of iterations reaches the upper limit, the training will stop; otherwise, the training will be retrained. The specific flow chart is shown in Figure 3.

4. Experimental Analysis

4.1. Experimental Data

This paper selects the official website of Taxi & Limousine Commission in the United States. The data content is the order track data of yellow taxis in New York City in the first 14 days of January 2022; the yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data content is shown in Table 1.

4.2. Data Preprocessing

For the original data, format conversion shall be carried out to eliminate the wrong data, invalid data, and redundant data. By sorting the “Total_amount” column in the data, delete the column where the nonpositive value of this column is located. If the total cost value is lower than 0, it is invalid data. At the same time, delete the column where the “high taxi fare” (value: 401095.62) is to ensure that the total cost is within a reasonable range. According to the time sequence, the order time in this paper is subject to the order start time, excluding the data that the order start time is not between 1 January 2022, and 14 January 2022. After data preprocessing, the original 1,048,576 pieces of data are reduced to 1,036,791 pieces. The overall income of yellow taxis in New York City is shown in Figure 4.

4.3. Building Local Models

The data used in this paper are the yellow taxi data records of 14 days in New York. The daily data of New York is established into a local model, which is equivalent to one of the “client” of federal calculation, and the boundary node is calculated according to the order start time. The data for the whole day on 1 January are intercepted, and the drivers’ income is shown in Figure 5.

Due to the cultural differences between China and the West, passengers in New York City need to pay various fees for taking taxis, such as tips, tolls, taxes, additional surcharges, congestion fees, and improvement fees. Due to the great uncertainty of tips, it is difficult to establish the model. In addition, other fees are small, fixed fees, except for tips, which rarely affect the overall income level. Therefore, when establishing the local model, the order time, the number of passengers, the order driving distance, and the actual taxi cost are fitted by multiple linear fitting. The parameter information of the local model is shown in Table 2.

Among them, the intercept value in Table 2 is

b_{0}

, and other values are the local model parameter weight

w_{k}

. The gradient descent method is adopted for model training because of the gradient:

g_{w_{k}} = (f_{w_{k}, b} (x) - y) f_{w_{k}, b}^{'} (x) x

(15)

Therefore, when the value of

f_{w_{k}, b} (x) - y

is infinitely close to 0, the model reaches the optimal state. For the above local model parameters, the loss function

L (f_{w, b} (x), y)

of each model is calculated. The specific parameters are substituted into the loss function calculation formula, and the loss values of each model are shown in Table 3.

4.4. Local Parameter Upload

To avoid the differences between various local models and better protect the parameter information from being leaked, the server completes the training with stable participants and specially defines the weight for each participant, so that each participant can participate in the federation aggregation operation at the same level. According to the weight calculation formula, the weight value of the first iteration of each client is obtained, as shown in Table 4.

According to the calculation formula of global model parameters:

\begin{array}{l} [[w_{g, j}^{t}]] = Π_{i = 1}^{S} E n c (ψ_{i} w_{i, j}^{t}) = E n c (ψ_{1} w_{1, j}^{t}) \cdot E n c (ψ_{2} w_{2, j}^{t}) \cdot \dots \cdot E n c (ψ_{S} w_{S, j}^{t}) \\ = E n c (ψ_{1} w_{1, j}^{t} + ψ_{2} w_{2, j}^{t} + \dots + ψ_{S} w_{S, j}^{t}) \\ = E n c (\sum_{i = 1}^{S} ψ_{i} w_{i, j}^{t}) \\ \Rightarrow w_{g, j}^{t} \Rightarrow ψ_{1} w_{1, j}^{t} + ψ_{2} w_{2, j}^{t} + \dots + ψ_{S} w_{S, j}^{t} \\ = \sum_{i = 1}^{S} ψ_{i} w_{i, j}^{t} \end{array}

(16)

Try to obtain the plaintext weight parameters participating in the global model. After adjusting with

δ

, the results are shown in Table 5.

For plaintext

m_{1}

and

m_{2}

, add homomorphic encryption:

E (m_{1}) = g^{m_{1}} r_{1}^{n} \mod N^{2}

(17)

E (m_{2}) = g^{m_{2}} r_{2}^{n} \mod N^{2}

(18)

E (m_{1}) E (m_{2}) = g^{m_{1} + m_{2}} {(r_{1} r_{2})}^{n} \mod N^{2}

(19)

E (m_{1} + m_{2}) = g^{m_{1} + m_{2}} r_{1}^{n} \mod N^{2}

(20)

D (E (m_{1}) E (m_{2})) = m_{1} + m_{2} = D (E (m_{1} + m_{2}))

(21)

The above is the nature of additive homomorphic encryption. The weight of the participants and local model parameters in this paper will be uploaded to the aggregation server, so according to the above nature:

[[w_{i}^{k}]] = E (ψ_{i} w_{i, j}^{k}, (g, N)) = g^{ρ_{i} w_{i}^{t}} r_{i}^{n} (\mod N^{2})

(22)

{[[w_{i}^{k}]]}^{λ} = g^{ρ_{i} w_{i}^{t} λ} r_{i}^{λ N} (\mod N^{2})

(23)

where

λ

is the private key. After the definition generated by the Paillier key, obtain the plaintext:

w_{i}^{k} = \frac{L ({[[w_{i}^{k}]]}^{λ})}{L (g^{λ})} = \frac{N α ρ_{i} w_{i}^{t} λ}{N α λ} = ρ_{i} w_{i}^{t}

(24)

Because the public and private keys are changed in each iteration, the encrypted ciphertext value will also change each time. In addition, the ciphertext length is considerable. The ciphertext with only two parameter values is listed here, as shown in Table 6.

4.5. Update Model Parameters

After the local model parameters and client weights are uploaded to the aggregation server, the aggregation server performs normal aggregation operations according to the protocol requirements and completes the aggregation calculation of local model parameters in the ciphertext state.

During each iteration, the algorithm will select the participants whose quantity ratio is

ρ

. When

ρ = 1

, it means that the gradient of all data held by all participants decreases in the whole batch. In the

t

iteration of the global model weight update, the

k

participant selected will calculate

g_{k} = \nabla F_{k} (w_{t})

, that is, the average gradient obtained by using local data under the model with the current parameter of

w_{t}

, and then the server aggregates the gradient to calculate the loss function. The aggregation server completes the aggregation averaging operation and broadcasts the aggregated averaged parameters to each participant according to the trusted key pair of the trusted third party for updating each local model:

w_{t + 1} \leftarrow w_{t} - η \sum_{k = 1}^{K} \frac{N_{k}}{n} g_{k}

(25)

Among them,

η

represents the learning rate.

4.6. Update Model Parameters

Firstly, plaintext data are used to participate in the calculation, and plaintext data are used for Paillier homomorphic encryption. After encryption, the data participating in the calculation and decryption correspond 100% to the plaintext, which also proves the availability and accuracy of Paillier, which is in line with the original idea of privacy calculation. The ciphertext (part) after aggregation average is

21595496292932768166433183401932295686888324907433148045378518021694888033094776148514 **** 224136040722433222096253194275188298103856990576077698.

314223982137955375407799678571118500191580455433313129110711964229401836733344558 **** 6051796337913784636444060936490979695777088055406290942001076.

691412498812284842471944703714880581308999024752208726862352964760298736105 *** 48251591476989509945605198810645098425292042433130466743904791.

The plaintext data after decryption of the above ciphertext are 0.00020918338242142856, 0.11335807857142856, 2.330568062142857.

Next, in the whole iterative calculation process, 10 and 1000 iterations are selected for research, and their convergence, accuracy, and loss are discussed. Due to the large computational power of ciphertext calculation data and the high cost of data communication, the number of iterations is taken as 10. The iteration is shown in Figure 6.

The stars in Figure 6 represent the exact values of each iteration of the model. Obviously, it is difficult to achieve the convergence effect in 10 iterations, but it can be seen that the accuracy of the global model is on the rise. By the 10th iteration, the accuracy has reached 72.31%. To intuitively see the convergence of the whole model, change the iteration parameters to 1000 times. See Figure 7 for the iteration at this time.

In Figure 7, the stars represent the exact values of each iteration of the model, the model accuracy value converges and reaches a very high level. According to the previous iteration results, the accuracy fluctuates up and down in the range of [0.93, 0.97], with a maximum value of 96.24%, and the accuracy after the 1000th iteration is 94.27%. At the same time, at the beginning of the iteration, it can also be found that the accuracy curve shows an upward trend, which corresponds to Figure 6. To study the convergence rate of the model, the first 100 iteration images are intercepted in combination with the image shown in Figure 7, as shown in Figure 8.

As can be seen in Figure 8, the stars represent the exact values of each iteration of the model, during the 100 iterations before the experiment, the accuracy of the model generally showed an upward trend, reflecting the closer and closer fit between the experimental value and the real value. It is determined that the model calculation converged after approximately 80 iterations.

Finally, calculate the loss function value, which represents the numerical loss of experimental data after iterative experiments. In the process of 1000 iterations, each iteration produces a loss value, and the distribution of all loss values is shown in Figure 9.

As can be seen from Figure 9, the stars represent the exact values of each iteration of the model, according to statistical theory, the loss value of most functions is distributed between 0.5–1.5, which is lower than that of the local model. However, to more intuitively illustrate this advantage, the average value of the 1000 loss values is taken as the effect index after the model calculation. Similarly, the traditional security federation model is calculated, that is, each local model is not assigned a weight, and the local model parameters are directly encrypted to participate in the aggregation average process. Since the above figure shows the good stability of the model, the average value of the function represents the loss level. The final loss function average is 1.9910, and the improved model loss function average is 1.0886, both lower than the loss value of the local model parameters. See Figure 10 for details.

In Figure 10, the hollow dot represents the average loss function calculated by the traditional security federation model of 1.9910, the hollow box represents the average loss function calculated by the improved security federation model of 1.0886, and the blue solid line represents the actual loss value of each local model. In terms of numerical value, the weighted improved security federation model is lower than the traditional model, and lower than each local model. Compared with the loss value of the local model, the loss function value of the traditional federated model is high and low, and the average value of the two is almost the same.

Through experiments, it is proved that by adding and setting local model weights, the model under double encryption is closer to the optimal level of federal computing than the traditional horizontal federal learning, is more helpful to predict and evaluate the income of yellow taxi drivers in New York City, and more reflects the robustness and adaptability of the improved model.

5. Conclusions

This paper completed the improvement of the security federation model in the horizontal federation learning framework, added weight values to the local model of each participant, encrypted the local model parameters, and uploaded them to the aggregation server to continue the following calculation. Setting the weight value is a “secondary protection” for the data and model parameters of each client and also eliminates the differences between local models. The improved secure federation average algorithm in this paper was mainly introduced in five parts: identification and authentication, model initialization, local model training, local parameter upload, and model parameter update. The algorithm was applied to the driver income of yellow taxis in New York City. A global model was built for the 14-day income of taxi drivers in the city, and a local model was built for the daily income of drivers. The aggregation average was carried out in the aggregation server, and the gradient descent iterative method was adopted. The accuracy of the model reached the highest level, and the loss of the model reached the lowest value. Therefore, the model can better reflect the income of yellow taxi drivers in New York City. However, the research process in this paper is not comprehensive enough. For example, when taking a taxi in the United States, passengers not only pay the fare but also pay tips, taxes, extra fees, and so on. The establishment of the model is uncertain and is not suitable for studying the actual payment of passengers. Therefore, the research takes the driver’s income as the research object. In addition, there are many invalid data in the original data, and these data are relatively discrete, which makes the research results potentially have some deviation, but there will be further improvement in the future.

Author Contributions

Conceptualization, X.Y. and J.X.; methodology, X.Y.; software, X.Y. and J.X.; validation, X.Y.; formal analysis, X.Y.; investigation, X.Y.; resources, Y.L.; data curation, X.Y.; writing—original draft preparation, X.Y.; writing—review and editing, Y.L.; visualization, X.Y. and T.H.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61972334).

Data Availability Statement

The data were selected from the public dataset kaggle website.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (No. 61972334).

Conflicts of Interest

The authors declare no conflict of interest.

References

Luan, Z.X. Research on the security of cloud computing data in the Internet era. Comput. Era 2021, 2021, 35–37+41. [Google Scholar]
Yang, J.G. The Generative Mechanism and Management of Ethical Dilemmas of Privacy Protection in the Era of Big Data. Jiangsu Soc. Sci. 2021, 2021, 142–150+243. [Google Scholar]
Xiong, C.L.; Tong, Y.Q. Boundary and balance: Data governance path and thinking in personal privacy information protection. Sci. Technol. Commun. 2021, 13, 64–68. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.J.; Tong, Y.X. Federated Machine Learning. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Devon, D.; Holzer, T.; Sarkani, S. Innovation-Based Fusion of Multiple Satellite Positioning Systems for Minimizing Uncertainty. IEEE Syst. J. 2019, 13, 928–939. [Google Scholar] [CrossRef]
Sainz De Baranda, P.; Cejudo, A.; Jesus Moreno-Alcaraz, V.; Teresa Martinez-Romero, M.; Aparicio-Sarmiento, A.; Santonja-Medina, F. Sagittal spinal morphotype assessment in 8 to 15 years old Inline Hockey players. PeerJ 2020, 8, E8229. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Baghban, H.; Huang, C.; Hsu, C. Resource provisioning towards OPEX optimization in horizontal edge federation. Comput. Commun. 2020, 158, 39–50. [Google Scholar] [CrossRef]
Chen, S.Q.; Xue, D.Y.; Chuai, G.H.; Yang, Q.; Liu, Q. FL-QSAR: A federated learning-based QSAR prototype for collaborative drug discovery. Bioinformatics 2020, 36, 5492–5498. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.J.; Feng, G.; Sun, Y.; Qin, S.; Liang, Y.C. Device Association for RAN Slicing Based on Hybrid Federated Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2020, 69, 15731–15745. [Google Scholar] [CrossRef]
Topaloglu, M.Y.; Morrell, E.M.; Rajendran, S.; Topaloglu, U. In the Pursuit of Privacy: The Promises and Predicaments of Federated Learning in Healthcare. Front. Artif. Intell. 2021, 4, 746497. [Google Scholar] [CrossRef]
Zhu, H.Y.; Zhang, H.Y.; Jin, Y.C. From federated learning to federated neural architecture search: A survey. Complex Intell. Syst. 2021, 7, 639–657. [Google Scholar] [CrossRef]
Rajendran, S. Cloud-Based Federated Learning Implementation Across Medical Centers. JCO Clin. Cancer Inform. 2021, 5, 1–11. [Google Scholar] [CrossRef] [PubMed]
Subramanya, T.; Riggio, R. Centralized and Federated Learning for Predictive VNF Autoscaling in Multi-Domain 5G Networks and Beyond. IEEE Trans. Netw. Serv. Manag. 2021, 18, 63–78. [Google Scholar] [CrossRef]
Zhu, X.; Yang, G. PCA differential privacy data publishing algorithm in horizontal federated learning. Appl. Res. Comput. 2022, 39, 236–239+248. [Google Scholar]
Lau, H.; Henry, T.; Yung, P.N.; Dilupa, L.; Caeman, K.M. Risk quantification in cold chain management: A federated learning-enabled multi-criteria decision-making methodology. Ind. Manag. Data Syst. 2021, 121, 1684–1703. [Google Scholar] [CrossRef]
Liu, Y.H.; Qu, Y.Y.; Xu, C.H.; Hao, Z.C.; Gu, B. Blockchain-Enabled Asynchronous Federated Learning in Edge Computing. Sensors 2021, 21, 3335. [Google Scholar] [CrossRef] [PubMed]
Pan, Y.H.; Zhou, P.; Agrawal, A.; Wang, Y.H. New insights into the methods for predicting ground surface roughness in the age of digitalisation. Precis. Eng. 2021, 67, 393–418. [Google Scholar] [CrossRef]
Zhu, P.K. Research on Federated Learning Model and Algorithm based on Mobile Edge Computing. Master’s Thesis, Nanjing University of Posts and Telecommunications, Nanjing, China, 2021. [Google Scholar]
Lu, Y.L. Research on data Privacy Protection and Secure Data Sharing Methods. Ph.D. Dissertation, Beijing University of Posts and Telecommunications, Beijing, China, 2020. [Google Scholar]
Alloghani, M.; Alani, M.M.; Al-Jumeily, D. A systematic review on the status and progress of homomorphic encryption technologies. J. Inform. Secur. Appl. 2019, 48, 2019. [Google Scholar] [CrossRef]
Zhou, Q.; Lu, S.; Cui, Y. Quantum Search on Encrypted Data Based on Quantum Homomorphic Encryption. Sci. Rep. 2020, 10, 5135. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Transverse federated learning feature space and sample space.

Figure 2. Horizontal federation learning client–server architecture.

Figure 3. Horizontal federation learning client–server architecture.

Figure 4. Revenue of yellow taxis in New York City in January.

Figure 5. Driver income in New York City on 1 January.

Figure 6. Accuracy variation diagram of 10 iterations.

Figure 7. Accuracy variation diagram of 1000 iterations.

Figure 8. Accuracy variation diagram of the first 100 iterations.

Figure 9. Loss value of 1000 iterations.

Figure 10. Comparison between the average loss of the improved model and the loss of the traditional model.

Table 1. Order Data and Track Data Information.

Order Data Label	Description	Example
lpep_pickup_datetime	The date and time when the meter was engaged.	1 January 2022 0:00:08
lpep_dropoff_datetime	The date and time when the meter was disengaged.	1 January 2022 0:14:14
Passenger_count	The number of passengers in the vehicle.	1
Trip_distance	The elapsed trip distance in miles reported by the taximeter.	7.94
Store_and_fwd_flag	This flag indicates whether the trip record was held in a vehicle memory before sending it to the vendor.	N
PULocationID	TLC Taxi Zone in which the taximeter was engaged.	138
DOLocationID	TLC Taxi Zone in which the taximeter was disengaged.	255
RateCodeID	The final rate code in effect at the end of the trip.	1
Payment_type	A numeric code signifying how the passenger paid for the trip.	1
Fare_amount	The time-and-distance fare calculated by the meter	23
Extra	Miscellaneous extras and surcharges.	0.5
MTA_tax	USD 0.50 MTA tax that is automatically triggered based on the metered rate in use.	0.5
Improvement_surcharge	USD 0.30 improvement surcharge assessed on hailed trips at the flag drop.	0.3
Tip_amount	This field is automatically populated for credit card tips. Cash tips are not included.	4.86
Tolls_amount	Total amount of all tolls paid in a trip.	0
Total_amount	The total amount charged to passengers (does not include cash tips).	30.41
congestion_surcharge	Automatically allocated according to the measurement rate in use.	2.5

Table 2. Parameter table of local training model.

	Intercept		Time		Passenger_Count		Trip_Distance		Statistics
$i$	Value	Standard Error	Value	Standard Error	Value	Standard Error	Value	Standard Error	Adjusted R Square
1	4.4963	0.04911	7.90296 × 10^-5	6.58141 × 10^-6	0.10346	0.02346	2.48404	0.00528	0.78987
2	4.99378	0.06705	6.31535 × 10^-5	9.45205 × 10^-6	0.19667	0.03366	2.45348	0.00633	0.72942
3	6.84776	0.06752	3.54939 × 10^-4	1.34833 × 10^-5	0.23423	0.03719	1.81088	0.00609	0.57277
4	4.59964	0.04311	9.01216 × 10^-5	7.48631 × 10^-6	0.09247	0.02371	2.51129	0.00495	0.78597
5	6.72315	0.06082	5.35508 × 10^-4	1.53895 × 10^-5	0.19601	0.034	1.71246	0.00624	0.53545
6	7.22593	0.05737	4.27444 × 10^-4	1.23054 × 10^-5	0.23779	0.03221	1.49912	0.00592	0.47019
7	6.06	0.05529	2.35914 × 10^-4	1.08658 × 10^-5	0.16382	0.03021	1.85406	0.0064	0.5601
8	4.49096	0.03552	6.85826 × 10^-5	6.55283 × 10^-6	0.03827	0.01807	2.52072	0.00454	0.79646
9	4.42007	0.05028	6.21412 × 10^-5	9.73679 × 10^-6	0.03261	0.02572	2.52484	0.00563	0.76983
10	5.19199	0.04877	2.51099 × 10^-4	1.22026 × 10^-5	0.08173	0.02748	2.31687	0.00609	0.68499
11	4.66912	0.03696	1.96368 × 10^-4	8.35207 × 10^-6	0.04615	0.02082	2.48739	0.00515	0.76779
12	5.04285	0.04058	1.85677 × 10^-4	9.20457 × 10^-6	0.03056	0.02266	2.40991	0.00571	0.70677
13	5.0685	0.03517	1.50399 × 10^-4	7.11874 × 10^-6	0.04386	0.01937	2.42489	0.00491	0.75656
14	4.87388	0.03509	1.51473 × 10^-4	6.93594 × 10^-6	0.02742	0.01925	2.51615	0.00463	0.79363

Table 3. Loss values for each local model.

Client	1	2	3	4	5
Loss	1.983	1.769	3.673	1.977	2.790
Client	6	7	8	9	10
Loss	2.611	3.168	1.428	2.013	2.014
Client	11	12	13	14
Loss	1.342	1.849	1.491	1.386

Table 4. The weight value of the first iteration of each client.

Client	1	2	3	4	5	6	7
Weight	1.107	1.097	0.987	0.990	1.018	0.999	1.148
Client	8	9	10	11	12	13	14
Weight	0.957	1.135	1.066	1.073	1.022	0.947	0.941

Table 5. Global model plaintext weight parameters.

$i$	$ψ_{i}$	$w_{i, 1}^{t}$	$w_{i, 2}^{t}$	$w_{i, 3}^{t}$
1	0.595	0.047	0.062	1.477
2	0.862	0.054	0.169	2.114
3	1.367	0.485	0.320	2.475
4	0.756	0.068	0.070	1.899
5	1.513	0.810	0.296	2.590
6	1.232	0.527	0.293	1.847
7	0.947	0.223	0.155	1.755
8	0.685	0.047	0.026	1.726
9	0.858	0.053	0.028	2.166
10	1.145	0.288	0.094	2.653
11	0.779	0.153	0.036	1.936
12	0.901	0.167	0.028	2.171
13	0.752	0.000	0.033	1.823
14	0.737	0.000	0.020	1.855

Table 6. Ciphertext of the parameter value.

Ciphertext
Plaintext: 7.90296 × 10⁻⁵, Local model parameters 467955223102456212020705609866270348510893239149668473328424631046163690822437508711878985196197980450966500960003705334974280170339655750810219089129396085992598258672579610836765861787210473675224154209738857222440023315300692968047087231394512980730723977907690271899363662246165478289615153218232574561057949086114385110846695015350995138765746145605015279777950573110994544008683384022603114608651756658952592661266736411408161426646791173899631148278778420886130542466141732139434402701109610305834327986770394754422293491129830004959597471528240922862992766963399797494953223572922119780921343929602123516226546764815183789251169773407702535753226691046130282688248672437264190969732509417383856881643956027630275658489463231296913649226445921957970809319310748381576204501832811777209686852160599382374030694637282942599348941011907550986218863974727800258866828106664495631023416369971144863957435198236651427580289916915911244629558444858595888149638159997776130063235855390386100640660678856046849103703733840941266568709071793346814024807420936630963924283997628396208879073040125624428043948679461395648474879664719912865166134464521421720547985762205981822665782223794666807841734568143717078244584656216644190890183947
Plaintext: 1.107 Cilent Weight 542668593891279026805624626146758625291153100936160670826053091065537229106182546052530540548774787072469124285407047736078487187066518276308290145578375312427125064570317400308530975919384618178036639860621593561533653322880977773419073086270457962531239215185735242020884636600391253929303829227690160632055355635447880958627580397656941016377197145408585526438953020193468957510929216689654165366329926337133784266205374661291090120351504336537932240149806720969730646953841277831552412094932985865479498608414619775181793358986748834334437846079939485846482496495629944011541746380639671509684239095754321959820318102974126215154739058291879812942742334500648651994495566709143482896700506958257481598793473675550399220882623673283793892416770530776917018796048658611303595729306564189333242635729140252800198129123397698494506725979944622028635443835992201975763460443431013747412861376819344647015089996916136618367623998327507099656125843558450496905169872123196108506956250733148418802328952590503257118421329674172634168153904874295407870248711968545420876320203448027987466871897975189509054334654474673284693407101626929792802446523404210423849205563988241679250362814657747157532701181352125841949375211847222624206208994
p: 142562846575319296 7506178382976203844105070937 q: 14439136290904674701 44497097022268651346246713 g: 20584843717203680702 *** 1969859265641225568080082

Where p, q, and g are key information.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Liu, Y.; Xie, J.; Hao, T. Improved Design and Application of Security Federation Algorithm. Electronics 2023, 12, 1375. https://doi.org/10.3390/electronics12061375

AMA Style

Yang X, Liu Y, Xie J, Hao T. Improved Design and Application of Security Federation Algorithm. Electronics. 2023; 12(6):1375. https://doi.org/10.3390/electronics12061375

Chicago/Turabian Style

Yang, Xiaolei, Yongshan Liu, Jiabin Xie, and Tianbao Hao. 2023. "Improved Design and Application of Security Federation Algorithm" Electronics 12, no. 6: 1375. https://doi.org/10.3390/electronics12061375

APA Style

Yang, X., Liu, Y., Xie, J., & Hao, T. (2023). Improved Design and Application of Security Federation Algorithm. Electronics, 12(6), 1375. https://doi.org/10.3390/electronics12061375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Design and Application of Security Federation Algorithm

Abstract

1. Introduction

2. Horizontal Federal Learning Framework

2.1. Horizontal Federal Learning Training Process

2.2. Global Model

2.3. Federal Average

3. Scheme Design Based on Weighted Federal Average

3.1. Objectives and Requirements

3.2. Specific Scheme Design

3.2.1. Identification and Authentication

3.2.2. Model Initialization

3.2.3. Training Local Model

3.2.4. Upload Local Model Parameters

3.2.5. Update Global Model Parameters

4. Experimental Analysis

4.1. Experimental Data

4.2. Data Preprocessing

4.3. Building Local Models

4.4. Local Parameter Upload

4.5. Update Model Parameters

4.6. Update Model Parameters

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI