Differentially Private XGBoost Algorithm for Traceability of Rice Varieties

Yu, Runzhong; Yang, Wu; Yang, Chengyi

doi:10.3390/app122111037

Open AccessArticle

Differentially Private XGBoost Algorithm for Traceability of Rice Varieties

by

Runzhong Yu

^1,2,†

,

Wu Yang

^1,* and

Chengyi Yang

^3,†

¹

Information Security Research Center, Harbin Engineering University, Harbin 165001, China

²

College of Information and Electrical Engineering, Heilongjiang Bayi Agricultural University, Daqing 163319, China

³

Institute of AI for Education, East China Normal University, Shanghai 200062, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(21), 11037; https://doi.org/10.3390/app122111037

Submission received: 9 September 2022 / Revised: 17 October 2022 / Accepted: 21 October 2022 / Published: 31 October 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Privacy protection in agricultural traceability has received more and more attention. Most of the existing methods only protect the original data information from the perspective of cryptography and ignore the availability of the protected information. In fact, after data is processed by cryptography, blockchain, and other technologies, it cannot be directly used for machine learning model training. Therefore, differential privacy has great potential value for privacy protection in agricultural traceability, which can enable data to participate in classification tasks under privacy protection. In this paper, we propose an integrated algorithm for agricultural traceability called Differentially Private XGBoost (DP-XGB), which can protect the privacy of the original data during the training process and obtain high model utility under the condition of a small sample size. We inject Gaussian noise into the gradient operator and Hesse operator of the original XGBoost and give the calculation method of the resulting privacy budget. Experiments show that our method can effectively obtain differential privacy guarantees and achieves very high classification accuracy when the noise is small.

Keywords:

differential privacy; variety traceability; rice data safety; machine learning

1. Introduction

The security of rice is an important issue related to human security and the global economy as an important cereal crop. Tracing the variety, origin, and authenticity of rice has become a hot research topic at home and abroad [1,2]. The spectroscopy [3], stable isotope analysis [4], volatile matter analysis [5], mineral element analysis [6], and metabolomics [7] combined with data mining are often used to identify and protect the authenticity of the variety and origin of rice, and to ensure the safety and quality of rice [8,9].

In the process of traceability, blockchain [10], partial least squares [11], cluster analysis [12], deep learning [13], neural network [14], and other methods were used for discriminant modeling, in order to realize the traceability of origin or variety through the established model.

However, in the process of using the model to trace the variety or origin, the characteristic information will be easily leaked or falsified. The characteristic data of the established model is lost or the established discriminant model is invalid. These pose a serious threat to the quality and authenticity of variety and the origin of rice. Therefore, it is urgent to protect the privacy of the data to ensure the authenticity of the data and the effectiveness of the model.

To solve the privacy problems in the process of establishing the traceability discrimination model, a variety of privacy protection methods have been proposed. Qu [15] proposed a new composite chaotic key design, which can better meet the needs of information encryption and ensure the security of agricultural information on the basis of retaining the original chaotic nature in order to ensure the information security in the process of agricultural information transmission [16,17].

Due to the computational complexity of the above methods, a large number of encryption and decryption calculations are required when privacy protection is carried out, which is costly and the computational efficiency is limited, and it faces great limitations in practical applications. Differential privacy [18] technology is widely recognized as a strict privacy protection method, and its privacy protection and robustness have good theoretical support. It prevents the adversary from inferring the existence of specific records by adding noise to the query results, so as to protect data privacy and information security. For example, Chukkapalli et al. [19] proposed a framework in which the privacy of individual owners of smart farms is protected by adding noise to the data through data perturbation techniques such as white Gaussian noise. In this method, noise is directly added to the data, and the authenticity and integrity of the data cannot be guaranteed after perturbation.

In this work, we innovatively proposed a tracing model of the XGBoost rice variety based on a differential privacy perturbation mechanism, which can protect the tracing feature data well. The experimental results show that the improvement of differential privacy perturbation mechanism does not affect the accuracy of the discriminant model, and can ensure the authenticity and integrity of rice variety data and discriminant model.

Our main contribution can be summarized as follows:

(1) We propose a differential privacy XGBoost algorithm, which realizes the traceability of rice varieties under privacy protection, and measures the risk of privacy leakage.

(2) We propose a privacy protection method by adding Gaussian noise to the first derivative and the second derivative of the Taylor expansion, and can adjust the privacy protection level by changing the parameters.

(3) We carried out experiments on rice variety traceability with our privacy protection algorithm, which further confirmed the effectiveness and realizability of the method we proposed.

2. Related Work

2.1. Agricultural Traceability

To ensure the quality of agricultural products, the traceability of agricultural product information is becoming more and more important [20,21]. The key to the traceability of agricultural product information is how to ensure the accuracy and integrity of the traceability information. A large number of intelligent agricultural information traceability systems with their own characteristics have emerged to enable the public to easily query and track the whole process information of agricultural products from planting to listing [22]. In this process, the security of data is particularly important. It is necessary to ensure that the data collected in all links are accurate and unaltered [23]. Only in this way, the whole process of information traceability can be truly achieved, and the final consumers can accurately obtain the relevant information on agricultural products.

2.2. Encrpytion Tachniques

Network information encryption based on cryptography is a common method to ensure the security of the information used by traceability system, including a privacy protection method based on symmetric encryption (e.g., data encryption standard, DES) and asymmetric encryption (such as RSA) [24]. The privacy protection method based on symmetric encryption can encrypt and protect the basic data well, and the hardware requirement is low, and the encryption speed is fast. However, it is not suitable for data encryption in the traceability model because it cannot prevent data tampering before encryption. The encryption function of the privacy protection method based on asymmetric encryption is strong, but the encryption hardware requirements are high and the encryption speed is slow. Moreover, these two methods cannot prevent illegal tampering of data by legitimate personnel inside the traceability model, so they are not suitable for data encryption processing in the traceability model.

The other method is the protection method based on hierarchical security policy (such as RBAC) [25]. This method can assign different information processing rights according to different personnel attributes, but it cannot completely prevent high-level personnel from illegally tampering with relevant agricultural product data, so it is not suitable for data encryption processing in the traceability system.

2.3. Blockchain Technology

As a new technology in the field of data security [26], blockchain technology has been developing rapidly in recent years, and its application is becoming more and more extensive [27]. Kamble et al. [28] identified traceability as the most important enabler of blockchain technology in agricultural supply chain applications, and provided a reference for blockchain technology development strategy to ensure a real-time driven agricultural supply chain. Behenke et al. [29] proposed the necessary boundary conditions before the application of blockchain technology, including traceability process, unified interface standards, joint platform, and independent governance. Ho et al. [30] built a parts logistics management platform based on blockchain technology to improve the quality of logistics data and ensure the security of information sharing.

It can be said that blockchain technology ensures the sharing and security of all links of data circulation through decentralized core algorithms. At the same time, in the whole blockchain system, each node can be used as the central node, and each computing node is independent of the other, so that even if a node fails, the security and integrity of the whole data can be guaranteed. However, the open and transparent nature of blockchain has led to the problem of user identity and data privacy disclosure.

2.4. Differential Privacy

As one of the most popular privacy protection methods, differential privacy [18] has solid mathematical theoretical support. Maruseac et al. [31] abstract the dataset workflow as a set of points in a multi-dimensional space and use differential privacy to protect the path confidentiality of traceability information. The differential privacy is completely independent of the background knowledge of the attacker. After centralized collection, random noise is added to distort sensitive data before release. Some data after noise is added still retain some statistical characteristics of the original data [32], so it is more suitable to establish a traceability system to protecting data privacy.

2.5. XGBoost

Xgboost is widely used in data mining analysis as a popular classifier [33]. Wang et al. [34] proposed a distributed privacy-protection boosting algorithm suitable for various classifiers. The algorithm uses local differential privacy (LDP) as a building block to build the basic learner using the aggregation of the perturbed data shares to ensure that the privacy of the participating data owners is well protected.

For better data mining, many people combine privacy protection with a classifier [35,36], for example, Patil et al. proposed the DiffPRF algorithm [35], which is mainly a combination of differential privacy protection and random forest, but the algorithm can only process discrete features, which need to be discretized before processing continuous features.

This paper proposed a privacy-protected XGBoost algorithm for rice variety traceability. Compared with the random forest algorithm, this algorithm has better performance in preventing overfitting. The results show that the fusion of the XGBoost algorithm and differential privacy protection method can not only maintain data privacy, but also analyze the privacy-protected data, which can well solve the privacy protection requirements in the process of rice variety traceability.

3. Methodology

3.1. Differential Privacy

In this section, we review some basic concepts of differential privacy, which leads to our definition of this agricultural privacy protection issue. Differential privacy is a representative method in privacy machine learning, and it is the most universal definition can be described as follows.

Definition 1

(Differential Privacy [18]). Suppose o is an outcome that satisfies

o \in O

, and

O

is the output space. A randomized algorithm

M

is said to preserve ϵ-differential privacy if any two adjacent datasets which only differ in a single record satisfy

\begin{matrix} P r [M (D) \in o] \leq e^{ϵ} P r [M (D^{'}) \in o] + δ . \end{matrix}

(1)

where

D, D^{'} \in D

are the adjacent datasets, and δ is a failure probability. f is a certain algorithm waiting to be protected. When it comes to the difference in the resulting level caused by adjacent datasets, we can describe that by sensitivity as follow.

Definition 2

(

l_{p}

-sensitivity [37]).Suppose

f : D \to R

is an algorithm, where

D

and

R

represents dataset space and the real number space.

l_{p}

-sensitivity can be defined as

\begin{matrix} Δ_{p} = max_{D, D^{'}} {∥ f (D) - f (D^{'}) ∥}_{p} . \end{matrix}

(2)

where

{∥ \cdot ∥}_{p}

represents the p-norm, and p can be any positive real number. With the above theoretical preparation, we can define some substantive privacy protection measures. The Gaussian mechanism is a technique to realize differential privacy protection by injecting Gaussian noise, and its definition can be described as follows.

Definition 3

(Gaussian Mechanism [38]). Suppose a randomized algorithm

M

satisfies

∥ M (D) - M (D^{'}) ∥_{2} \leq Δ_{2}

for any adjacent datasets

D, D^{'}

, the Gaussian mechanism

\begin{matrix} M (D) = f (D) + N (0, Δ f \cdot σ^{2}) \end{matrix}

(3)

with scale

\begin{matrix} σ \geq \sqrt{2 l n (\frac{1.25}{δ})} \cdot \frac{Δ_{2}}{ϵ} \end{matrix}

(4)

satisfies

(ϵ, δ)

-DP.

3.2. XGBoost Algorithm

Suppose

D = {(x_{i}, y_{i})}, i = 1, 2, \dots, N

is the training set, where

x_{i} \in R^{d}

represent the features and

y_{i}

represent the labels. d represents the dimension of the feature. We develop a new XGBoost algorithm named DPXGB to obtain

ϵ

-differential privacy guarantee. To achieve this goal, we need to define the basic model of the boosting method. The strong classifier model composed of K weak classifiers can be described by the following expression

\begin{matrix} {\hat{y}}_{i} = ϕ (x_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F . \end{matrix}

(5)

where

F = {f (x) = w_{q (x)}} (q : R^{d} \to T, w \in R^{T})

is the hypothesis space of the weak classifiers. q represents the structure of each weak classifier. T is the number of leaf nodes in a decision tree or regression tree.

\begin{matrix} L = \sum_{i = 1}^{N} l ({\hat{y}}_{i}, y_{i}) + \sum_{k = 1}^{K} Ω (f_{k}) . \end{matrix}

(6)

where

Ω (f) = γ T + \frac{1}{2} λ {∥ w ∥}^{2}

is a regular term that limits the complexity of the model, thus ensuring that the model will not be over-fitting. By using Taylor expansion to expand to the second order, Equation (6) can be approximately expressed by the following form

\begin{matrix} L ≃ \sum_{i = 1}^{N} [l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + \sum_{k = 1}^{K} Ω (f_{k}) . \end{matrix}

(7)

where

g_{i} = L^{'} (y_{i}, {\hat{y}}_{i}^{(t - 1)})

and

h_{i} = L^{''} (y_{i}, {\hat{y}}_{i}^{(t - 1)})

represent the first-order and second-order derivatives of loss function.

Since

l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

is obtained at the

t - 1

-th iteration, which is a certain value at the t-th iteration, we can omit this constant term.Let

I_{j} = {i | q (x_{i}) = j}

be the set of all samples in the j-th leaf node. Thus, we can obtain the expression of the loss function as follow

{\tilde{L}}^{(t)} = \sum_{j = 1}^{T} [(\sum_{i \in I_{j}} g_{i}) w_{j} + \frac{1}{2} (\sum_{i \in I_{j}} h_{j} + λ) w_{j}^{2}] + γ T .

(8)

To obtain the optimal leaf weight

w^{*}

, we take the derivative of the Equation (8) and let it be equal to zero. Then we have

\{\begin{matrix} W_{j}^{*} = - \frac{G_{j}}{H_{j} + λ}, \\ O b j^{(t) *} = min O b j^{(t)} \approx γ T - \frac{1}{2} \sum_{j = 1}^{T} \frac{G_{j}^{2}}{H_{j} + λ} . \end{matrix}

(9)

where G and H are the cumulative values of the first derivative and the second derivative on the j-th leaf node, respectively. We can also express them mathematically as

G_{j} = \sum_{i \in I_{j}} g_{i}

,

H_{j} = \sum_{i \in I_{j}} h_{j}

.

O b j^{(t) *}

represents the optimal objective function at the t-th iteration, which is obtained by substituting the optimal weight into the objective function.

3.3. Differentially Private XGBoost

Through the above derivation process, we have obtained the optimal weight and objective function. However, this boosting algorithm process does not have the property of privacy protection. The implication is that some privacy attack algorithms can steal sensitive information about input features from the training set.

To protect the privacy of information in the original data, we consider using the Gaussian mechanism. We inject noise into the first derivative and the second derivative of the loss function (Figure 1), making it more difficult for the attacker to infer the training set information through the intermediate operator or the operation result.

\{\begin{matrix} G \leftarrow \tilde{G} + N (0, σ_{G}^{2}), \\ H \leftarrow \tilde{H} + N (0, σ_{H}^{2}) . \end{matrix}

(10)

where

\tilde{G} = \sum_{i \in I_{j}} g_{i}

and

\tilde{H} = \sum_{i \in I_{j}} h_{i}

are the cumulative value on a leaf node of the first derivative and the second derivative with privacy-preserving property respectively.

σ_{G}

and

σ_{H}

are the scale parameters injecting noise into

\tilde{G}

and

\tilde{H}

respectively.

If we continue to divide the subtree at the leaf node, we can obtain the cumulative gradient (first derivative of loss) and Hessian (second derivative of loss) on the left leaf node and the right leaf node. We denote them as

{\tilde{G}}_{L}

and

{\tilde{H}}_{L}

, and they satisfies the condition

{\tilde{G}}_{L} + {\tilde{G}}_{R} = \tilde{G}

. We subtract the objective function after division from the objective function before division and define it as Gain as follows.

\begin{matrix} G a i n = O b j_{b e f o r e}^{*} - O b j_{a f t e r}^{*} = \frac{1}{2} [\frac{{({\tilde{G}}_{L} + {\tilde{G}}_{R})}^{2}}{{\tilde{H}}_{L} + {\tilde{H}}_{R} + λ} - (\frac{{\tilde{G}}_{L}^{2}}{{\tilde{H}}_{L} + λ} + \frac{{\tilde{G}}_{R}^{2}}{{\tilde{H}}_{R} + λ})] - γ . \end{matrix}

(11)

Since Equation (11) is an objective function, we can simplify it as follows as long as we do not change its curve shape

\begin{matrix} G a i n = \frac{{({\tilde{G}}_{L} + {\tilde{G}}_{R})}^{2}}{{\tilde{H}}_{L} + {\tilde{H}}_{R} + λ} - \frac{{\tilde{G}}_{L}^{2}}{{\tilde{H}}_{L} + λ} - \frac{{\tilde{G}}_{R}^{2}}{{\tilde{H}}_{R} + λ} \end{matrix}

(12)

Then we can outline the main steps of our Differentially Private XGBoost Algorithm in Algorithm 1.

3.4. Privacy Computing

Since the boosting algorithm needs to be calculated through multiple iterations, we need to calculate the cumulative value of the privacy budget through composition theorem.

Lemma 1

(Basic Composition [39]). Let a series of randomized algorithm

M_{i}, i = 1, \dots, t

satisfies

(ϵ_{i}, δ_{i})

-differential privacy. Then

M = {M_{1}, \dots, M_{t}}

satisfies

(\sum_{i = 1}^{t} ϵ_{i}, \sum_{i = 1}^{t} δ_{i})

-differential privacy.

According to the property of the Gaussian mechanism (see Equation (4)), we can calculate that the privacy budget provided by injecting Gaussian noise with scale parameter

σ

once is

ϵ = \sqrt{2 l n (\frac{1.25}{δ})} \cdot \frac{Δ_{2}}{σ} .

Then, according to the basic composition theorem, we can obtain the cumulative privacy budget during t iterations as

\begin{matrix} \sum_{i = 1}^{t} \sqrt{2 l n (\frac{1.25}{δ})} \cdot \frac{Δ_{2}}{σ} . \end{matrix}

(13)

Algorithm 1 Differentially Privated XGBoost Algorithm

Input:

The set of instances for current node I;

The customized scale of Gaussian noise

σ_{g}

,

σ_{h}

;
Output:

1:: $\tilde{G} \leftarrow \sum_{i \in I} g_{i}, \tilde{H} \leftarrow \sum_{i \in I} h_{i}$ .
2:: Inject noise in gradient $G \leftarrow \tilde{G} + N (0, σ_{G}^{2})$
3:: Inject noise in Hessian $H \leftarrow \tilde{H} + N (0, σ_{H}^{2})$
4:: for each $k \in [1, m]$ do
5:: $G_{L} \leftarrow 0$ , $H_{L} \leftarrow 0$ ;
6:: for j in $s o r t e d (I, b y x_{j k})$ do
7:: Compute gradient of loss function $g_{j} \leftarrow \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$
8:: Compute Hessian matrix of loss function $h_{j} \leftarrow \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$
9:: $G_{L} \leftarrow G_{L} + g_{j}$ , $H_{L} \leftarrow H_{L} + h_{j}$
10:: $G_{R} \leftarrow G - G_{L}$ , $H_{R} \leftarrow H - H_{L}$
11:: $G a i n \leftarrow \frac{{({\tilde{G}}_{L} + {\tilde{G}}_{R})}^{2}}{{\tilde{H}}_{L} + {\tilde{H}}_{R} + λ} - \frac{{\tilde{G}}_{L}^{2}}{{\tilde{H}}_{L} + λ} - \frac{{\tilde{G}}_{R}^{2}}{{\tilde{H}}_{R} + λ}$
12:: Determine whether to divide leaf nodes by $G a i n$ .
13:: end for
14:: end for
15:: return A privacy-preserving decision tree satisfies $(ϵ, δ)$ -DP, cumulative Privacy Budget $ϵ$ , failure probability $δ$ .

Although the cumulative privacy budget in multiple iterations can be calculated through Equation (13). However, basic composition is usually not efficient, which means that we are likely to overestimate the cumulative privacy budget. Therefore, we introduce the advanced composition theorem, expecting to achieve a tighter privacy budget.

Lemma 2

(Advanced Composition [40]). Suppose

ϵ, δ, δ^{'} \geq 0

and

M_{i}, i = 1, \dots, t

satisfies

(ϵ, δ)

-differential privacy. Then

M = {M_{1}, \dots, M_{t}}

satisfies

(ϵ^{'}, t δ + δ^{'})

-differential privacy and

\begin{matrix} ϵ^{'} = \sqrt{2 t l n (2 / δ^{'}) ϵ} + t ϵ (e^{ϵ} - 1) . \end{matrix}

(14)

Similarly, we can also obtain the cumulative privacy budget during t iterations from Lemma 2 as

\begin{matrix} \sqrt{2 t l n (2 / δ^{'}) \sqrt{2 l n (\frac{1.25}{δ})} \cdot \frac{Δ_{2}}{σ}} + t \sqrt{2 l n (\frac{1.25}{δ})} \cdot \frac{Δ_{2}}{σ} (e^{\sqrt{2 l n (1.25 / δ)} \cdot \frac{Δ_{2}}{σ}} - 1) \end{matrix}

(15)

Comparing with Equation (15), the advantages of the Equation (13) will continue to appear with the increase of iterations.

4. Experiments

4.1. Data Collection and Preprocessing

The experimental samples were selected during the rice harvest period in 2020, and 340 samples including Qijing 11, Longyang 16, Longdao 18, and Longqingdao 21 were collected by a five-point field sampling method. Sample information is shown in Table 1.

The flow chart is shown in Figure 2. The collected samples were processed by air in the laboratory, and then removed impurities, selected, husked into brown rice, and ground into white rice. The processing accuracy was in line with the first-grade national standards. The moisture content of the samples was controlled below 14%, and the samples were sealed and stored in a ziplock bag and stored in a refrigerator at 4 ℃. During measurement, the external detector of the diffuse Fourier near-infrared spectrometer is connected, and the rotating measuring device is placed on the top. The sample is put into the optical plate, and the bottom of the plate is cleaned with a mirror wiping paper. During the test, it is necessary to ensure that the rotary measuring device rotates at a constant speed.

To reduce the interference of the instrument itself, it is necessary to preheat 2 h in advance before measuring the instrument. The ambient temperature of the experiment is (25 ± 1) °C, the relative humidity is 25–35%, the spectral wave number range is 12,000∼4000 cm

^{- 1}

, and the resolution is 8 cm

^{- 1}

. The scanning is repeated 64 times. Calibrate the signal and save the peak level to eliminate the noise interference generated by the operation of the instrument.

4.2. Analysis of Near-Infrared Spectrogram of Rice

The absorption of the near-infrared spectrum is mainly caused by the change of vibration state inside the molecule. The frequency characteristics of molecular vibration determine the information range and characteristics of near-infrared spectrum analysis.

Chinese rice varieties are rich, and different varieties of rice genes are different, which will affect the internal chemical composition of rice content, such as sugar, protein, fat, and other subtle differences, the taste and quality of rice will be very different, which will also lead to different varieties of rice to the absorption of near-infrared spectrum differences. Figure 3 shows the near-infrared spectra of four varieties of rice.

It can be seen from Figure 3 that the four varieties of rice samples have the same absorption peak in the same wave-number range, but the absorption intensity is different. The absorption peaks in the wave-number range of 9000∼8000 cm

^{- 1}

represent

C H_{3}

in aliphatic hydrocarbons. The absorption peaks in the wavenumber range of 7000∼6000 cm

^{- 1}

represent the information of free NH groups, which can reflect the type and content difference of amino acids in the sample. The absorption peak in the wave-number range of 5000∼4000 cm

^{- 1}

is in the C-H first combination spectrum region, including free OH, CH

_{2}

, and CH

_{3}

, which can characterize the protein and amylose contents in rice samples.

4.3. Evaluation on Privacy

Next, we study the effectiveness of our DP-XGB algorithm at the level of privacy protection. We compare the changes in the privacy budget with the number of iterations under the Gaussian mechanism with different scale parameters. The results are shown in Figure 4a. We find that the privacy budget decreases with the increasing intensity of Gaussian noise, which indicates that the privacy guarantee is strengthened after this change.

In addition, we conducted experiments to continuously increase the scale parameters under different failure probabilities. The results in Figure 4b not only show that the privacy budget decreases with the increase of scale parameters, but also show that appropriately tolerating higher failure probability can reduce the privacy budget to a certain extent. Therefore, we can adjust the scale and failure probability appropriately to ensure differential privacy within a reasonable range.

4.4. Evaluation on Utility

We compared the attack performance of DP-XGBoost and several integrated learning methods with Gaussian mechanism, including: (1) Random Forest (RF): the most popular bagging method constructed by decision trees, (2) AdaBoost: the most common boosting algorithm composed of decision trees. We stipulated that the depth of each tree should not exceed 5. We compare the performance of these algorithms on 3-classification tasks and 4-classification tasks respectively.

In 3-classification tasks, the highest accuracy is 100% for both AdaBoost and DP-XGBoost when

σ

= 0.001, higher than the performance of Random Forest. When

σ

= 0.002, DP-XGBoost achieve the accuracy of 96.8%, higher than 93.5% for RF, and 91.9% for AdaBoost . When it comes to 4-classification tasks, DP-XGBoost has the highest accuracy for both

σ

= 0.001 and

σ

= 0.002. The above results show that our DP-XGBoost outperforms Random Forest and AdaBoost on the task of privacy-preserving rice traceability (See Table 2).

4.5. Ablation Study

To better explore the impacts of differential privacy on the utility of our algorithm, we propose the ablation study of the first-order operator and the second-order operator. We find that when the differential privacy guarantee of the first-order operator is removed, the classification accuracy of the model does not show a significant downward trend with the increase of noise intensity. However, when the differential privacy guarantee of the second-order operator is deleted, the changing trend of the curve is similar to that of using the differential privacy guarantee in the first-order operator and the second-order operator at the same time.

The above phenomena show that the main reason for the decline of the utility of the model is the differential privacy protection for the first-order operators. This is an empirical phenomenon, because usually the absolute value of the first-order operator is much larger than that of the second-order operator. In practice, scale

σ

should be set below 0.4 to ensure that the accuracy of the rice traceability model reaches more than 94%, which corresponds to the privacy budget of

ϵ = 12.11

. Nevertheless, the privacy protection of the first-order operator is far more important than the second-order operator, because it is easier to infer or recover information about the original data from the first-order operator [41] (See Figure 5).

5. Conclusions

In this paper, we propose a private machine learning method for the traceability of rice varieties, named DP-XGB. We suggest applying differential privacy as a measure of the risk of privacy leakage, and endowing the model with the function of privacy protection through the Gaussian mechanism. More specifically, we select the first derivative and the second derivative of the loss function generated in the iterative process of the XGBoost model perturbed by random noise that follows Gaussian distribution, so that the model can obtain a privacy guarantee in the traceability task. Experiments on the data we collected verified that DP-XGB has the privacy-preserving property of a differential privacy guarantee. In addition, we also give a method to calculate the cumulative privacy budget, and show how to achieve different levels of differential privacy guarantee by adjusting parameters.

Author Contributions

Conceptualization, R.Y. and C.Y.; methodology, R.Y. and C.Y.; software, R.Y. and C.Y.; validation, R.Y. and C.Y.; resources, W.Y.; data curation, R.Y.; writing-original draft, R.Y. and C.Y.; writing-review and editing, W.Y.; supervision, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R & D Program of China of 2018YFE0206300; The Central Government for the Reform and Development of Local Universities in Heilongjiang Province of 2020YQ16; Heilongjiang BaYi Agricultural University for San Heng San Zong of ZRCQC201906.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Source code can be accessed from https://github.com/yrzsuper/DP-XGBoost.

Acknowledgments

We really appreciate Key Laboratory of Agro-products Processing and Quality Safety of Heilongjiang Province and the deep learning team from the Cross-Innovation Laboratory in East China Normal University for their fully support.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

FT-NIR	Fourier transform near infrared
DP	Differential Privacy

References

Uawisetwathan, U.; Karoonuthaisiri, N. Metabolomics for rice quality and traceability: Feasibility and future aspects. Curr. Opin. Food Sci. 2019, 28, 58–66. [Google Scholar] [CrossRef]
Bai, Y.; Tan, L.Q.; Zhao, S. Research Progress on origin tracing and authenticity of rice. J. Anhui Agric. Sci. 2021, 49, 22–29. [Google Scholar]
Hwang, J.Y.; Kang, S.; Lee, K.; Chung, H. Enhance raman spectroscopic discrimination of the geographical origins of rice samples via transmission spectral collection through packed grains. Talanta 2012, 101, 488–494. [Google Scholar] [CrossRef] [PubMed]
Suzuki, Y.; Nakashita, R.; Huque, R.; Khatun, M.A.; Othman, Z.B.; Salim, N.A.B.A.; Thantar, S.; Pabroa, P.C.; Kong, P.Y.K.; Waduge, V.A.; et al. Effects of processing on stable isotope compositions (δ13C, δ15N, and δ18O) of rice (Oryza sativa) and stable isotope analysis of asian rice samples for tracing their geographical origins. Jpn. Agric. Res. Q. JARQ. 2022, 56, 95–103. [Google Scholar] [CrossRef]
Qiu, Y.C.; Zhao, Y.C.; Chen, H.; Peng, J.N.; Qian, L.L. Study on origin of volatile components of rice. Farm Prod. Process. 2018, 1, 40–43. [Google Scholar]
Cui, C.; Wang, Z.H. Study on rice origin discrimination in Jilin province based on combined analysis of mineral elements and isotopes. Cereals Oils 2022, 6, 36–40+44. [Google Scholar]
Feng, Y.; Fu, T.X.; Zhang, L.Y.; Wang, C.Y.; Zhang, D.J. Research on differential metabolites in distinction of rice (Oryza sativa L.) origin based on GC-MS. J. Chem. 2019, 1, 1–7. [Google Scholar] [CrossRef] [Green Version]
Li, F.; Wang, J.H.; Xu, L.; Wang, S.X.; Zhou, M.H.; Yin, J.W.; Lu, A.X. Rapid screening of cadmium in rice and identification of geographical origins by spectral method. Int. J. Environ. Res. Public Health 2021, 49, 22–29. [Google Scholar] [CrossRef] [Green Version]
Liu, X.H.; Liu, C.L.; Sun, X.R.; Yang, Y.F.; Lin, L. Study on fast identification method of rice origin traceability based on fourier transform infrared spectroscopy technology. Food Sci. Technol. 2021, 46, 244–249. [Google Scholar] [CrossRef]
Yakubu, B.M.; Latif, R.; Yakubu, A.; Khan, M.I.; Magashi, A.I. RiceChain: Secure and traceable rice supply chain framework using blockchain technology. PeerJ Computer. Sci. 2022, 8, e801. [Google Scholar] [CrossRef]
Zhang, L.; Wang, S.S.; Ding, Y.F.; Pan, J.R.; Zhu, C. Discrimination of transgenic rice based on near infrared reflectance spectroscopy and partial least squares regression discriminant analysis. Rice Sci. 2015, 22, 245–249. [Google Scholar]
Xu, W.Y.; Tang, X.L.; Li, K.L.; Tang, W.J. Research on geographical quality of Se-rich rice based on cluster analysis. J. Instrum. 2022, 17, 22–29. [Google Scholar]
Yan, C.H.; Lu, A. A deep learning method combined with electronic nose to identify the rice origin. J. Instrum. 2022, 17, 8–16. [Google Scholar] [CrossRef]
Son, S.; Kim, D.; Choul, C.M.; Lee, J.; Kim, B.; Min, C.C.; Kim, S. Weight interpretation of artificial neural network model for analysis of rice (Oryza sativa L.) with near-infrared spectroscopy. Food Chem. X 2022, 49, 22–29. [Google Scholar]
Qu, M.Z. Design and analysis of agricultural information transmission based on mixed chaotic encryption. J. Northeast. Agric. Univ. 2012, 43, 92–95. [Google Scholar]
Huning, L.; Bauer, J.; Aschenbruck, N. A Privacy Preserving Mobile Crowdsensing Architecture for a Smart Farming Application. In Proceedings of the First ACM Workshop on Mobile Crowdsensing Systems and Applications (CrowdSenSys’17), New York, NY, USA, 5 November 2017; pp. 62–67. [Google Scholar]
Hang, L.; Ullah, I.; Kim, D. A secure fish farm platform based on blockchain for agriculture data integrity. Comput. Electron. Agric. 2020, 170, 105251. [Google Scholar] [CrossRef]
Cynthia, D. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
Chukkapalli, S.; Ranade, P.; Mittal, S.; Joshi, A. A Privacy Preserving Anomaly Detection Framework for Cooperative Smart Farming Ecosystem. In Proceedings of the 2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), Atlanta, GA, USA, 13–15 December 2021; pp. 340–347. [Google Scholar]
Wang, Y.N. Research on the Application of Computer Technology in Food Safety Traceability System. J. Phys. Conf. Ser. 2021, 1992, 032102. [Google Scholar] [CrossRef]
Masudin, I.; Ramadhani, A.; Palupi, D. Traceability system model of Indonesian food cold-chain industry: A Covid-19 pandemic perspective. Clean. Eng. Technol. 2021, 4, 100238. [Google Scholar] [CrossRef]
Rao, E.S.; Shuklab, S.R. Food traceability system in India. Meas. Food 2022, 5, 100019. [Google Scholar] [CrossRef]
Fan, M.S. Reserch And Implementation Of Privacy Protection Agricultural Product Traceability System Based On Blockchain. Master’s Thesis, Harbin Institute of Technology, Shenzhen, China, 2021. [Google Scholar]
Majid, M.; Luo, P. Forty years of attacks on the RSA cryptosystem: A brief survey. J. Discret. Math. Sci. Cryptogr. 2019, 22, 9–29. [Google Scholar]
Zhao, H.T.; Shen, J.G.; Chen, H.J. RBAC model in the application of food safety traceability platform. Electron. Des. Eng. 2017, 25, 49–52. [Google Scholar]
Waheed, N.; He, X.J.; Ikram, M.; Usman, M.; Hashmi, S.S.; Usman, M. Security and Privacy in IoT Using Machine Learning and Blockchain: Threats and Countermeasures. ACM Comput. Surv. 2020, 53, 1–37. [Google Scholar] [CrossRef]
Feng, H.H.; Wang, X.; Duan, Y.Q.; Zhang, J.; Zhang, X.S. Applying blockchain technology to improve agri-food traceability: A review of development methods, benefits and challenges. J. Clean. Prod. 2020, 260, 121031. [Google Scholar] [CrossRef]
Kamble, S.S.; Gunasekaran, A.; Sharma, R. Modeling the blockchain enabled traceability in agriculture supply chain. Int. J. Inf. Manag. 2020, 52, 101967. [Google Scholar] [CrossRef]
Behnke, K.; Janssen, M. Boundary conditions for traceability in food supply chains using blockchain technology. Int. J. Inf. Manag. 2020, 52, 101969. [Google Scholar] [CrossRef]
Ho, G.T.; Tang, Y.M.; Tsang, K.Y.; Tang, V.; Chau, K.Y. A blockchain-based system to enhance aircraft parts traceability and trackability for inventory management. Expert Syst. Appl. 2021, 179, 115101. [Google Scholar] [CrossRef]
Maruseac, M.; Ghinita, G.; Rughinis, R. Privacy-preserving publication of provenance workflows. In Proceedings of the 4th ACM Conference on Data and Application Security and Privacy, San Antonio, TX, USA, 3 March 2014; pp. 159–162. [Google Scholar]
Liu, F.; Yang, C.; Zhou, A. Graph Hilbert Neural Network. Chin. J. Electron. 2022, 32, 140–150. [Google Scholar] [CrossRef]
Chen, T.Q.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Wang, S.; Chang, J.M. Privacy-preserving boosting in the local setting. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4451–4465. [Google Scholar] [CrossRef]
Patil, A.; Singh, S. Differential private random forest. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi, India, 24–27 September 2014; pp. 2623–2630. [Google Scholar]
Neunhoeffer, M.; Wu, Z.W.S.; Dwork, C. Private post-GAN boosting. arXiv 2020, arXiv:2007.11934. [Google Scholar]
Dwork, C.; Lei, J. Differential privacy and robust statistics. In Proceedings of the STOC ’09: Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009; pp. 371–380. [Google Scholar]
Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
McSherry, F.D. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009; pp. 19–30. [Google Scholar]
Dwork, C.; Rothblum, G.N.; Vadhan, S.P. Boosting and differential privacy. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, Las Vegas, NV, USA, 23–26 October 2010; pp. 51–60. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]

Figure 1. The schematic diagram of DP-XGBoost model. We inject Gaussian noise with

σ_{G}

and

σ_{H}

as scale into the first derivative (gradient) and the second derivative (Hessian) respectively to guarantee the ensemble tree model is protected by DP. Gradient and Hessian are applied to calculate the objective function, and the training process aims to minimize the objective function.

Figure 1. The schematic diagram of DP-XGBoost model. We inject Gaussian noise with

σ_{G}

and

σ_{H}

as scale into the first derivative (gradient) and the second derivative (Hessian) respectively to guarantee the ensemble tree model is protected by DP. Gradient and Hessian are applied to calculate the objective function, and the training process aims to minimize the objective function.

Figure 2. The flow chart of data cllection.

Figure 3. Original Fourier transform near-infrared (FT-NIR) Spectrogram of four rice varieties. Where the rice varieties corresponding to the results shown in sub-figures (a–d) are Longdao 18, Longqingdao 21, Longyang 16, Qijing 11 respectively.

Figure 4. Privacy performance test of our DP-XGB under different parameters. Sub-figure (a) represent the privacy budgets with increasing iterations under different scale parameters,

δ = 10^{- 5}

. Sub-figure (b) represents the privacy budgets with increasing scales under different failure probabilities when the number of iterations is 100.

Figure 4. Privacy performance test of our DP-XGB under different parameters. Sub-figure (a) represent the privacy budgets with increasing iterations under different scale parameters,

δ = 10^{- 5}

. Sub-figure (b) represents the privacy budgets with increasing scales under different failure probabilities when the number of iterations is 100.

Figure 5. Relationship between scale parameters and test accuracy on the task of rice traceability based on multi-classification differentially private XGBoost.

δ = 10^{- 5}

and the number of iterations is 100. Where (a,b) represent the experimental results of the three classification and the four classification task respectively.

Figure 5. Relationship between scale parameters and test accuracy on the task of rice traceability based on multi-classification differentially private XGBoost.

δ = 10^{- 5}

and the number of iterations is 100. Where (a,b) represent the experimental results of the three classification and the four classification task respectively.

Table 1. Rice sampling information for traceability experiments under differential privacy guarantee.

Variety	Origin	Amount	Number
Longyang 16	Wuchang, Heilongjiang, China	50kg	LY16
Longdao 18	Harbin, Heilongjiang, China	50kg	LD18
Longqingdao 21	Daqing, Heilongjiang, China	50kg	LQ21
Qijing 11	Daqing, Heilongjiang, China	50kg	QJ11

Table 2. Comparative experiment of privacy-preserving algorithms on the task of rice traceability, where GM is the abbreviation of Gaussian mechanism.

Type	Scale	RF+GM	AdaBoost+GM	DP-XGB
3-class	$σ = 0.001$	96.8%	100.0%	100.0%
	$σ = 0.002$	93.5%	91.9%	96.8%
4-class	$σ = 0.001$	81.7%	80.5%	82.9%
	$σ = 0.002$	69.5%	73.2%	76.8%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, R.; Yang, W.; Yang, C. Differentially Private XGBoost Algorithm for Traceability of Rice Varieties. Appl. Sci. 2022, 12, 11037. https://doi.org/10.3390/app122111037

AMA Style

Yu R, Yang W, Yang C. Differentially Private XGBoost Algorithm for Traceability of Rice Varieties. Applied Sciences. 2022; 12(21):11037. https://doi.org/10.3390/app122111037

Chicago/Turabian Style

Yu, Runzhong, Wu Yang, and Chengyi Yang. 2022. "Differentially Private XGBoost Algorithm for Traceability of Rice Varieties" Applied Sciences 12, no. 21: 11037. https://doi.org/10.3390/app122111037

APA Style

Yu, R., Yang, W., & Yang, C. (2022). Differentially Private XGBoost Algorithm for Traceability of Rice Varieties. Applied Sciences, 12(21), 11037. https://doi.org/10.3390/app122111037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Differentially Private XGBoost Algorithm for Traceability of Rice Varieties

Abstract

1. Introduction

2. Related Work

2.1. Agricultural Traceability

2.2. Encrpytion Tachniques

2.3. Blockchain Technology

2.4. Differential Privacy

2.5. XGBoost

3. Methodology

3.1. Differential Privacy

3.2. XGBoost Algorithm

3.3. Differentially Private XGBoost

3.4. Privacy Computing

4. Experiments

4.1. Data Collection and Preprocessing

4.2. Analysis of Near-Infrared Spectrogram of Rice

4.3. Evaluation on Privacy

4.4. Evaluation on Utility

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI