1. Introduction
Machine learning, as a critical tool for data utilization, has become the engine driving transformation across various industries. Through data modeling, it empowers computers with the ability to learn autonomously and to predict, finding wide applications in fields like financial risk assessment [
1], autonomous driving [
2], AI-driven healthcare [
3], smart manufacturing [
4], and more. In today’s digital era, safeguarding personal privacy has emerged as a crucial challenge in the information technology domain. Despite the rapid advancements in big data analysis and machine learning technologies, it is still a significant challenge to safeguard sensitive user information. It is vital to prioritize the protection of personal information while extracting valuable insights and knowledge from massive datasets.
Traditional methods of data privacy protection, such as anonymization [
5] and de-identification, have become increasingly vulnerable and prone to exploitation by advanced data analysis techniques. In this context, differential privacy [
6] has garnered extensive attention as a rigorous privacy protection framework. Differential privacy offers a mathematically rigorous approach that ensures that no specific individual’s information can be inferred during data analysis, even with knowledge of all other people’s data. This robust privacy-preserving property has facilitated the widespread application of differential privacy in sensitive environments.
In the realm of regression analysis, numerous studies have been conducted based on differential privacy analysis. Chaudhurietal et al. [
7] proposed several differential privacy regression analyses, but they are limited by the requirement that the objective function must be convex and twice differentiable. Lei [
8] introduced m-estimators differential privacy methods based on a maximum likelihood estimation framework, generating noise multidimensional histograms using the Laplace mechanism and utilizing synthetic data to compute regression results. However, this method is only applicable to lower-dimensional data. To handle high-dimensional data, Zhang et al. [
9] proposed a noise addition method based on a functional mechanism, perturbing the objective function of the optimization problem instead of the results, to enforce
-differential privacy. Kifer et al. [
10] focused on high-dimensional sparse regression problems, presenting a differential privacy-based convex empirical risk minimization (ERM) method. Smith [
11] combined the Laplace mechanism and the exponential mechanism, proposing a general differential privacy framework, but with limitations concerning the bounded output space, studying the asymptotic properties of differentially private algorithms for statistical inference. Barrientos et al. [
12]. designed algorithms for differential private estimators, regarding significance levels and symbols. Cai et al. [
13] studied the trade-off between statistical accuracy and privacy in mean estimation and linear regression in both classical low-dimensional and modern high-dimensional settings, proposing a novel private iterative hard threshold algorithm for high-dimensional linear regression. In the field of logistic regression, there is existing literature discussing the relevance of differential privacy. Chaudhuri and Monteleoni [
14] discussed the trade-off between privacy and learnability in designing algorithms for learning from private databases. They proposed a privacy-preserving logistic regression algorithm that solves the problem of sensitivity in regularized logistic regression. The approach involves adding noise to the learned classifier, which is proportional to the sensitivity. This is a method that is independent of the function’s sensitivity and can be used for a class of convex loss functions. Khanna et al. [
15] introduced a differentially private method for sparse logistic regression that maintains hard zero coefficients. They achieved this by first training a non-private LASSO logistic regression model to determine the appropriate number of non-zero coefficients for the final model selection. Xu et al. [
16] took a unique approach by combining functional mechanism and decision boundary fairness, to develop a differentially private and fair logistic regression model. This innovative method ensures both privacy preservation and fairness while maintaining good utility. Fan et al. [
17] proposed a privacy-preserving logistic regression algorithm (PPLRA) that uses homomorphic encryption to prevent data privacy leakage. They shifted the majority of computational tasks to the cloud, to enhance efficiency while safeguarding data privacy. Ji et al. [
18] modified the update steps in the Newton–Raphson method and presented a differentially private distributed logistic regression model based on both public and private data. This model improves practicality by leveraging a public dataset while simultaneously protecting the private dataset, all without compromising stringent privacy guarantees.
Simultaneously, many practical applications face a common challenge: the scarcity or expense of data in the target domain (the domain of the target task) while there may be ample labeled data available in the source domain (often a related but not entirely identical task domain). To overcome this data scarcity issue, transfer learning has emerged. The core concept of transfer learning lies in transferring knowledge learned from the source domain, to enhance the learning performance in the target domain.
The concept of transfer learning initially originated in the field of machine learning. In 1993, Pratt [
19] introduced a neural network-based transfer learning method, using weights obtained from a network trained for related source tasks to expedite learning for the target problem, exploring how to transfer knowledge across different tasks to improve learning effectiveness. Subsequently, in 1995, Thrun [
20] first proposed the concept of “knowledge transfer” and investigated methods for sharing knowledge between different tasks. This paper is considered one of the pioneering works in the field of transfer learning, laying the foundation for subsequent research. Building upon these pioneering papers, many researchers conducted in-depth studies. However, currently, transfer learning is mostly applied in the classification domain, including tasks like image classification [
21], text classification [
22], and time series classification [
23]. In the regression analysis domain, there are also some related literatures. Li et al. [
24] proposed a two-step method for estimating and predicting transfer learning on high-dimensional linear regression models, called Trans-Lasso, including a detection algorithm for consistently identifying transferable but unknown source datasets and a contrastive regularized estimator anchored to the target task. They demonstrated the robustness of Trans-Lasso in dealing with information-free auxiliary samples and its efficiency in knowledge transfer even when the auxiliary sample set is unknown. This paper marked an important step in the application of transfer learning in the regression domain. Bastani [
25] proposed a two-step procedure for studying transfer learning methods in high-dimensional linear regression models using a single source dataset without needing to detect transferable source datasets from candidate datasets. Tian and Feng [
26] extended this method to generalized linear models, introducing an algorithm to construct confidence intervals for each coefficient component and accompanying theoretical results. Yang et al. [
27] proposed a method that combines source and target datasets using a two-layer linear neural network, accurately calculating the asymptotic limits of transfer learning prediction risk on high-dimensional linear models. Zhou et al. [
28] proposed double robust transfer learning, to adapt to the challenges of label scarcity and covariate shift in the target task. Additionally, Lin and Reimherr [
29] conducted transfer learning research on functional linear regression models and established the optimal convergence rate for excess risk. Takada and Fujisawa [
30] proposed a method to transfer knowledge from the source domain to the target domain through high-dimensional
regularization, which, in addition to ordinary
regularization, also incorporates
regularization of the difference between source parameters and target parameters. This approach produces sparsity both for the estimate itself and for changes in the estimate, has tight estimation error bounds in stationary environments, and the estimate remains invariant to the source estimate under small residuals; even if the source estimate is wrong due to non-stationarity, the estimates are also consistent with the basis functions.
While differential privacy and transfer learning have each demonstrated effectiveness in real-world applications, combining these two paradigms and applying them to logistic regression models is a challenging and compelling area of research. Combining transfer learning and differential privacy solves the fundamental problem of how to leverage knowledge in one domain to enhance learning capabilities in another domain while ensuring the privacy of sensitive information within the model.
This paper proposes a method that combines differential privacy with transfer learning and applies it to logistic regression models. Our approach not only protects individual privacy but also facilitates knowledge transfer between the source and target domains. Specifically, (1) we protect individual privacy by introducing differential privacy noise into input data, and (2) we design a transfer learning strategy to achieve knowledge transfer by sharing feature representations between the source and target domains. Our experimental results demonstrate that our method achieves good performance in the target domain while preserving individual privacy, proving its effectiveness and feasibility in practical applications.
Figure 1 provides a general overview of the proposed algorithm’s conceptual framework in this paper.
The structure of this paper is as follows. In
Section 2, we present a setup for transfer learning using logistic regression with a focus on differential privacy. We also propose a transferable source domain detection algorithm based on cross-validation. In
Section 3, we conduct various numerical simulations and empirical analyses. Finally, in
Section 4, we provide a summary of our findings and suggest future research directions.
2. Method
2.1. Function Mechanism Differential Privacy Method Based on Logistic Regression Model
We first consider the function mechanism (FM) based on the logistic regression model, which is an extension of the Laplace mechanism in differential privacy. This privacy-preserving method does not directly inject noise into the regression results; instead, it ensures privacy by perturbing the optimization objective of the regression analysis.
The Laplace mechanism is a randomization technique in differential privacy that adds noise to query results using samples from the Laplace distribution. The probability density function of the Laplace distribution is given by , where b is the scale parameter. The Laplace distribution, being centered around zero with heavy tails, is suitable for introducing random noise. The fundamental idea of the Laplace mechanism is to maintain the usability of data analysis results while introducing moderate noise to make individual data changes difficult to trace. This mechanism aims to balance privacy protection and data utility. Its mathematical representation is , where is the query result, is the sensitivity of the query result, and is the privacy parameter.
Let be a dataset containing n tuples with attributes . For each tuple , we assume that . Our objective is to construct a regression model from that allows us to predict the value of any tuple on Y based on the values of . In other words, our goal is to obtain a function that takes as input and outputs a prediction for as accurately as possible.
For logistic regression, assuming that the attribute
Y in
takes values in the Boolean domain
, logistic regression on
returns a predictive function with the probability
predicting
, where
is a vector of
d real numbers. This can be achieved by minimizing the cost function
To ensure privacy, we require that the regression analysis should be performed using an algorithm that satisfies -differential privacy.
A random algorithm
satisfies
-differential privacy if and only if for any output
of M and for any two neighboring databases
and
we have
If satisfies -differential privacy, then the probability distribution of its output remains almost the same for any two input databases that differ by only one tuple.
The FM method we use does not directly inject noise into . Instead, it perturbs the objective function , then releases the model parameters that minimize the perturbed objective function . Here, we address this by utilizing a polynomial representation of . As is a vector containing values , let represent the product of , defined as , where . Let represent the powers of up to j, i.e., . According to the Stone–Weierstrass theorem, any continuous differentiable function can always be written as a (potentially infinite) polynomial in , i.e., for some we have , where represents the coefficients of in the polynomial.
Let
and
be any two arbitrarily neighboring databases, and let
and
represent the objective functions of regression analysis on
and
, respectively, with polynomial representations as
Thus,
For
, let
This yields the perturbed objective function, and the minimum solution can be computed from it.
Due to the requirement in differentially private methods based on the function mechanism that the polynomial form of the objective function should only contain bounded-degree terms, the logistic regression model fails to meet this condition. Hence, Zhang et al. [
9] proposed an approach based on Taylor expansion, to derive an approximate polynomial form for the objective function, demonstrating its effectiveness in achieving differential privacy. Therefore, in this paper, we adopt this validated approach. The form of the objective function is as follows:
2.2. Regression Transfer Learning Based on Differential Privacy
In this paper, we address the issue of multi-source domain transfer learning. Consider a target dataset
and
K source domain datasets
, where
and
. Here,
and
represent the
i-th row of
and the
i-th element of
, respectively. The objective is to transfer useful information from the source data, to enhance the model’s performance for the target data. We assume that the relationship between independent variables and dependent variables in the target data and source data follows the logical model
As
, distinct coefficients
are present. We refer to the target parameter as
and assume that it is sparse in the zero norm. This means that among
p variables, only
s variables contribute to predicting the response variable, where
. The similarity between the coefficients
of the source domain data and
determines the usefulness of the
k-th source domain in predicting the target domain. The
k-th source domain is more helpful in predicting the target domain if its coefficients
are closer to
. We measure the difference between the
k-th source domain and the target domain as
Consequently, we can derive the information set . In terms of the norm, if , we refer to the k-th source as h-transferable; if h then we refer to the k-th source as h-non-transferable. It is evident that a smaller h value implies greater benefits obtained from these source domains in the context of transfer learning.
In the field of differential privacy-based logistic regression transfer learning, there is a method known as the Oracle algorithm when
is already known. Our proposed algorithm follows the principles used by Bastani [
25], Li et al. [
24], Zhang et al. [
9], and others, which is called the differential privacy transfer learning algorithm. The main idea behind this algorithm is to transfer information from transferable sources, to obtain rough estimators in the first step. Then, in the second step, the target data are used to correct biases. To ensure privacy protection, Laplace noise is added to the data in both steps.
In the first step of transfer learning, parameter estimation is required:
We simplify
, adding noise to coefficients before
:
adding Laplace noise to
and
to obtain the privacy-protected
and
. As the value of
depends solely on
, the expression is modified to
We are performing both differential privacy and feature selection at the same time. To achieve this, we obtain the value of through cross-validation.
Likewise, in the second step of transfer learning, parameter correction is required:
We simplify
, adding noise to the coefficients before
:
We add Laplace noise to
and
, to obtain the privacy-protected
and
. As the value of
depends solely on
, the expression is modified to
We simultaneously conduct differential privacy and feature selection, where is obtained through cross-validation.
The aforementioned approach is denoted as the Oracle trans DPLR algorithm.
2.3. Transferable Source Detection
During our previous discussions, we assumed that the transferable set
was known. However, in practical applications, it can be challenging to fulfill this assumption. Simply selecting all sources based on the target data might not improve model performance; instead, it could lead to negative transfer and a decline in learning performance for the target task. To avoid negative transfer, this paper utilizes a simple, non-data-driven method, proposed by Tian et al., to detect transferable sources for information transfer. Initially, the target data are separated into three subsets:
. The average loss
is then computed through cross-validation of the target data. Next, Algorithm 1 is applied to various data subsets from each source domain and the target domain, and estimated coefficients are obtained. The average loss
is then computed on the test set. Finally, the difference between these two losses is compared against a predefined threshold. If the difference falls below the threshold, the source domains are added to set
.
Algorithm 1: Oracle Trans DPLR |
Input: target data , all h-transferable source data , penalty parameters and , Laplace noise Output: the estimated coefficient vector Step3: Let Step4: Output |
To facilitate notation, assuming
is divisible by 3, the average loss over
r iterations on the target domain dataset
for any estimated parameter
can be formally defined as
Algorithm 2 illustrates the specific details.
Algorithm 2: Transferable Source Detection |
|
3. Simulation Study
For this section, we validated the effectiveness of the proposed algorithm through multiple simulation experiments and real data verification. Some parameters and formula configurations in the data-generation phase were inspired by Tian and Feng [
26]. In the simulation segment, we initially compared the fitting results under different values of
h (the maximum deviation between the source and the target coefficients in
) and |
| (the cardinality of
) for naive DPLR using only target data and the Oracle trans DPLR proposed in Algorithm 1, involving three error settings. Subsequently, introducing
h-non-transferable sources, we compared naive DPLR using only target data, the Oracle trans DPLR proposed in Algorithm 1, all_trans DPLR using transfer learning on all source and target data, and trans DPLR involving transferable source detection. In the empirical study, we explored different approaches of naive DPLR, trans DPLR, and all_trans DPLR. All subsequent experiments were implemented using Python 3.12 code and solved optimization problems using the cvxpy library.
3.1. Known Transferable Source Domain
This article considers the following sparse logistic regression model, where the target and source datasets are independently generated from Equation (
19):
Here,
, and we transform the target variable
Y into a binary attribute by mapping values above a predefined threshold to 1 and values below or equal to the threshold to 0. Therefore, when employing the logistic regression model for the classification of
, if
then we predict
Y as 1; otherwise, it is predicted as 0. The parameters are set as follows:
. The coefficient vector for the target domain is
. For simplicity, let
be a
p dimensional vector with each component generated as 1 or −1 with a probability of 0.5. For the transferable source dataset
, we consider
, and the coefficients are given by
, ensuring
. Three error distributions are considered:
Standard normal distribution:
distribution:
Mixture of normal distributions:
To balance classification accuracy and privacy protection, we compared the changes in classification accuracy of logistic regression models before and after applying transfer learning under two different conditions:
and
in the scenario of
. The results are shown in
Table 1:
As can be seen from
Table 1, with the decrease of the noise addition amount
, the accuracy of the model before and after transfer learning gradually improved. When
the accuracy of the model before and after transfer learning decreased compared to
, but this decrease was still within an acceptable range, and the model’s performance could meet the needs of practical applications. Considering the need for privacy protection, we set
in subsequent experiments.
For each scenario, the experiment was repeated 100 times and the average
error of the estimated parameters
was used as the model evaluation metric. Each
during the experiment was chosen using cross-validation criteria. The average relative estimation error of the Oracle algorithm under three error settings is illustrated in
Figure 2.
Here we define the relative error of the oracle algorithm as
Therefore, negative values indicate that the Oracle trans DPLR algorithm outperformed the naive DPLR algorithm, with smaller values (represented by lighter colors in
Figure 2) indicating better performance.
Based on the detailed analysis of
Figure 2, the Oracle trans DPLR algorithm demonstrated significant performance advantages over the naive DPLR algorithm across a wide range of combinations of
h and dataset sizes
. This finding profoundly reveals the efficiency of the Oracle trans DPLR algorithm in handling complex data scenarios. In particular, as the number of data sources increased, the performance of the Oracle trans DPLR algorithm showed a marked improvement. Simultaneously, as the parameter
h increased, the complexity of the problem also rose, leading to an increase in the estimation error of the Oracle trans DPLR algorithm. However, it is noteworthy that the algorithm maintained stable prediction performance even when faced with different types of error distributions, which fully validates the robustness and generalization capability of its design.
3.2. Unknown Transferable Source Domain
First, we set the total number of source domains to 20, i.e.,
. Further, we construct transferable source domains and non-transferable source domains in scenarios where
. For transferable source domains, we keep the logistic regression parameters consistent with the previous context. For non-transferable source domains, we randomly select a subset
of size
s from the set
, and we set the
j-th component of the logistic regression parameter
, as follows:
All the other parameters remain the same as in the previous section. This paper compares the estimated values of the following algorithms:
Naive DPLR: Conducting differentially private logistic regression using only target domain data.
Trans DPLR: Performing differentially private transfer learning using Algorithm 1 and Algorithm 2.
All_trans DPLR: Conducting differentially private transfer learning using all available target domains.
Oracle trans DPLR: Conducting differentially private transfer learning based on Algorithm 1, given known transferable source domains.
Naive LR: Conducting logistic regression using only target domain data.
Trans LR: Performing transfer learning using transfer learning algorithm and transferable source detection algorithm.
For the six algorithms, we repeated the experiment 100 times and calculated the average error. Simultaneously, we treated the transferable source domain detection as a binary classification problem, and we computed model evaluation metrics for various scenarios.
Based on
Figure 3, it is evident that when the differential privacy mechanism was not introduced into the transfer learning framework, the prediction errors of the two algorithms were significantly lower than their differential privacy-enhanced versions. This stark difference highlights the accuracy loss caused by data perturbation (i.e., adding noise) as a means of privacy protection. However, this trade-off is a necessary sacrifice, to ensure privacy, aligning closely with the focus of our study. Within the differential privacy framework, the Oracle trans DPLR algorithm consistently demonstrated optimal performance. This was because the algorithm not only fully leveraged all transferable source data but also excluded potential noise sources, i.e., non-transferable source data, thereby ensuring efficient and accurate knowledge transfer. The trans DPLR algorithm also performed excellently, almost perfectly replicating the performance of the Oracle trans DPLR. This outcome reflects the high effectiveness and accuracy of our proposed transferable source detection algorithm. Further analysis revealed that when the scale of the transferable source set |
| was small, the performance of the all trans DPLR algorithm faced challenges, even performing worse than the naïve DPLR. This phenomenon proves the existence and significant negative impact of negative transfer, further validating the importance and effectiveness of our transferable source detection strategy. As |
| increased, the performance of all trans DPLR gradually improved, eventually reaching a level comparable to Oracle trans DPLR and trans DPLR at |
|= K = 20. Additionally, as the problem complexity
h increased, the errors of all four algorithms showed a significant upward trend.
3.3. A Real-Data Study
Our study focuses on the second-hand car market, aiming to accurately predict used car prices through an in-depth analysis of a real dataset containing over 50,000 records and encompassing 31 initial variables. The data preprocessing steps included handling missing values, addressing outliers, performing feature engineering, and conducting feature selection, which expanded the original dataset’s feature variables to 70, significantly enhancing the data foundation for the model.
To optimize the efficiency of applying differential privacy techniques, we normalized the predictor variables and transformed the response variable into a binary form, mapping values exceeding a predefined threshold to 1 and other values to 0, thereby simplifying the complexity of the analysis and meeting specific needs. Descriptive statistical analysis of the data showed a class imbalance in the “bodyType” feature, with the number of luxury sedan samples being the highest (32,291) and the number of concrete mixer samples being the lowest (799). Consequently, the concrete mixer samples were designated as target domain data, while samples of other body types were considered as source domain data.
To evaluate the performance of the transfer learning algorithms, 5-fold cross-validation was used in the experiments. After applying Algorithm 2 to source domain detection across all source domains, it was observed that vehicles with body types 2, 3, 4, and 6 exhibited the least error. Therefore, these body types were selected as transferable source domains for further analysis in subsequent experiments. The evaluation metric used was the misclassification rate of logistic regression, providing an objective measure of model performance (
Figure 4).
We used the following algorithm for the subsequent experiment:
Naive DPLR: Conducting differentially private logistic regression using only target domain data.
Trans DPLR: Performing differentially private transfer learning using Algorithms 1 and 2.
All_trans DPLR: Conducting differentially private transfer learning using all available target domains.
Naive LR: Conducting logistic regression using only target domain data.
Trans LR: Performing transfer learning using a transfer learning algorithm and a transferable source detection algorithm.
The experimental results indicate that although the introduction of differential privacy had some impact on the accuracy of transfer learning it effectively safeguarded data privacy and the accuracy loss remained within an acceptable range. Notably, the trans DPLR algorithm significantly outperformed the other algorithms in predictive capability for specific body types, highlighting its unique advantage in integrating diverse data sources. However, the overall performance of transfer learning using all source domains was suboptimal, primarily due to the negative effects of non-transferable sources. Nonetheless, the integration of data from different vehicle types with the trans DPLR algorithm significantly enhanced predictive accuracy for specific body types, validating the effectiveness of our transfer learning algorithm in cross-domain data utilization. From a practical perspective, the findings of this study have profound positive implications for the second-hand car market. Accurate prediction of used car prices boosts market participants’ confidence in transactions, fostering healthy and orderly market development; also, the application of differential privacy technology provides a robust protection for user data, meeting international data protection standards such as GDPR. Additionally, this study emphasizes the importance of identifying and eliminating negative transfer, further enhancing the practical effectiveness of the model by precisely selecting transferable source domains. In summary, our experimental results fully demonstrate the superiority of our transfer learning algorithm. Rooted in a function-based differential privacy mechanism and utilizing transferable source domains, this algorithm not only mitigates privacy risks during the transfer process but also simultaneously improves predictive capability within the target domain.