1. Introduction
With the rapid development of information technology, data have become essential resources in modern society. However, the efficient utilization of data has also introduced serious security risks [
1,
2,
3], particularly in fields such as healthcare, finance, and economics. Data security issues not only concern individual privacy but may also threaten the healthy development of industries and even social stability [
4]. In healthcare scenarios, data such as medical images, health records, and diagnostic information are often highly sensitive [
5,
6]. If such information is leaked, it could be misused for identity theft, insurance fraud, and other illegal activities [
7]. Furthermore, medical data have significant research value, and unprotected data could be exploited without authorization, leading to intellectual property violations [
8]. In banking and financial scenarios, if transaction records, account balances, or other financial data are leaked or stolen, it could cause economic losses for individuals and companies and may even trigger a credit crisis [
9]. For example, stolen high-frequency trading data could lead to market manipulation, creating instability in the entire financial system [
10]. In the field of economic data protection, the economic operation data of enterprises and nations serve as a crucial basis for decision making [
11]. If these data are leaked, they could be exploited by competitors, weakening market competitiveness. At the national level, the leakage of economic data could become a strategic tool for adversaries, threatening national economic security [
12]. Therefore, how to effectively protect data security during data sharing and analysis has become a key focus for society [
5]. The development of data protection technologies not only ensures the privacy and security of individuals but also provides important support for the healthy development of industries, further promoting the deep exploration and application of data value [
13].
In the realm of data protection technologies, traditional solutions mainly fall into two categories: encryption-based methods and data obfuscation-based methods [
14]. Fully Homomorphic Encryption (FHE) [
15] is an encryption method with theoretical completeness that allows computations on encrypted data without decrypting it [
16]. This technique ensures high data security through mathematical transformations, representing a significant breakthrough in protecting data privacy. However, the computational overhead of FHE is extremely high, making it unsuitable for practical applications [
17]. For instance, in real-time medical data analysis, the high latency of FHE could fail to meet clinical diagnosis time requirements. Differential privacy (DP) [
18] is a widely applied data privacy protection technique that introduces noise into the data to protect individual privacy [
19]. However, a major issue with DP is that the introduction of noise inevitably impacts the data’s effectiveness, resulting in a decrease in the accuracy of data analysis [
20]. For example, in financial forecasting, DP may obscure key details of stock price fluctuations, thus reducing the performance of prediction models. In summary, traditional encryption and obfuscation techniques have a clear conflict between data security and system performance: encryption methods offer high security but poor performance, while obfuscation methods achieve better performance at the cost of some security. This conflict is particularly prominent in modern application scenarios, necessitating new technologies to balance both.
In recent years, with the rapid development of deep learning technologies, deep learning-based models have gradually become an important tool in data protection [
21,
22]. Deep learning has demonstrated powerful capabilities in feature extraction, pattern recognition, and complex task optimization, providing new ideas for data security protection [
23,
24,
25]. Amanullah et al. [
26] proposed a deep learning- and big data-based framework to enhance the security of IoT devices, particularly in terms of adaptability and response speed when facing various network attacks, achieving an accuracy of 98.7%, a precision of 97.5%, and a recall of 96.2%. Chen et al. [
27] proposed several deep learning-based network security protection methods aimed at enhancing the defense capabilities and response speeds of systems in smart cities (e.g., traffic, energy, and healthcare) when facing cyber-attacks. Their approach achieved an accuracy of 95.4%, a false positive rate of 2.3%, and an average response time of 0.45 s. However, it was noted that deep learning model training requires a large amount of labeled data and may not perform efficiently on devices with limited computational resources.
Ferrag et al. [
28] introduced a new IoT network security protection method by combining federated learning and deep learning, enabling training on multiple IoT nodes without data centralization, thus protecting user privacy. The method showed a detection accuracy of 97.3%, with training efficiency improved by 20% and communication overhead reduced by 15%. However, the method’s performance may be suboptimal in high-latency network environments, and it still faces challenges with computational and communication loads on resource-constrained devices. Kshirsagar et al. [
29] proposed a deep learning-based multimedia security solution encompassing identity authentication, encryption, and information hiding to enhance multimedia data security. The experimental results showed that the authentication accuracy was 98.2%, the encryption speed was 200 MB/s, and the information hiding robustness was 95.7%. However, the method depends on high computational resources and may face performance bottlenecks when processing high-resolution data. Rahman et al. [
30] studied the security threat posed by adversarial samples to deep learning models for COVID-19, proposing protection methods to enhance model robustness. The experiments showed that the original model had an accuracy of 95.6%, but when subjected to adversarial attacks, the accuracy dropped to 68.3%. This method increases computational overhead and may not be effectively implemented on resource-limited devices. Deep learning models can learn the semantic features of data and abstract them into higher-level representations. This abstraction significantly reduces the risk of direct leakage of original data [
31].
Despite the numerous tools that deep learning offers for data security, its application in privacy protection still faces challenges. For instance, deep learning models themselves may become a vector for privacy leakage (such as model inversion attacks), and the high computational costs in both model training and inference still limit their large-scale application in practical scenarios. In response to these challenges, a data obfuscation security framework based on probability density and information entropy is proposed. By combining the feature extraction of probability distributions and the dynamic control of information complexity, this framework achieves a balance between data security and system performance. Compared to existing methods, the innovations of this framework include the following aspects:
Probability density extraction module: This module is designed to extract probability distribution characteristics from raw data. Compared to traditional random noise methods, it can more accurately capture data characteristics, thereby reducing the performance loss in downstream tasks.
Information entropy fusion module: This module quantifies the data’s complexity using information entropy and dynamically adjusts the strength of data obfuscation by combining Shannon entropy and conditional entropy. It can effectively protect sensitive information while ensuring the preservation of core data features.
Multi-scenario validation: The framework’s generality and effectiveness are validated in two typical application scenarios, healthcare and finance. Experiments demonstrate that the proposed method outperforms existing approaches in terms of accuracy, throughput, and security.
Balance between performance and security: The proposed method significantly improves system efficiency while achieving efficient obfuscation, reducing the computational overhead of encryption and decryption processes and making it suitable for real-time data analysis scenarios.
In conclusion, the proposed method provides a new approach to data protection across multiple scenarios by balancing data security and performance.
4. Results and Discussion
4.1. Implementation Details and Configurations of Baseline Methods
FHE was implemented using the widely recognized HElib library, which supports arithmetic operations on encrypted data. The encryption parameters were set to ensure a balance between computational efficiency and security strength, with a key size of 2048 bits. For operations involving addition and multiplication, the ciphertext modulus was carefully chosen to avoid excessive noise accumulation. During the experiments, encrypted medical and financial data were processed directly without decryption, following the standard FHE workflow. For DP, we employed the Laplace mechanism to ensure -differential privacy, where was set to 0.5 for a moderate privacy–utility trade-off. Noise was added to each query result based on the sensitivity of the function being evaluated. For medical image classification, pixel intensity values were perturbed; for financial predictions, noise was applied to key features such as stock prices and trading volumes. The resulting data retained privacy guarantees but at the expense of reduced accuracy due to noise interference.
SMC was implemented using a custom setup in a distributed environment. Data were partitioned into shares and distributed among three parties, each performing partial computations. The computations were securely aggregated to reconstruct the final results. The communication protocol followed the GMW protocol, ensuring privacy during the intermediate steps. FL was implemented using a federated setup, where models were trained locally on multiple nodes. Each node processed a subset of the data and shared gradient updates with a central server. The Federated Averaging Algorithm was employed to aggregate these updates into a global model. To ensure privacy, we used differential privacy at the gradient level, introducing noise during transmission. The FL setup used five client nodes and a communication round limit of 50. All baseline methods were tested on the same hardware and software environment to ensure consistency. The experiments were conducted using an NVIDIA A100 GPU with 80 GB of memory, leveraging PyTorch 1.12 for implementation.
4.2. Medical Image Classification Results
The objective of this experiment was to evaluate the performance of different data protection strategies in medical image classification tasks, focusing on four key metrics: precision, recall, accuracy, and FPS. This study aimed to validate whether the proposed data obfuscation method based on probability density and information entropy could balance privacy protection with task performance. Comparisons were drawn with traditional methods, such as FHE, DP, Multi-party Computation (MPC), and FL, to provide optimized solutions for medical image analysis tasks.
As shown in
Table 3, the performance of different models varied significantly across the four evaluation metrics. FHE demonstrated strong privacy protection capabilities but suffered from high computational complexity, resulting in lower classification performance (precision: 0.82; recall: 0.79; accuracy: 0.80) and a low processing speed (FPS: 23). While DP improved processing efficiency (FPS: 25) by introducing random noise, the added noise compromised the model’s classification performance (precision: 0.84; recall: 0.81; accuracy: 0.83). MPC leveraged data partitioning and distributed computation to enhance processing efficiency (FPS: 30), achieving moderate performance improvements (precision: 0.86; recall: 0.83; accuracy: 0.85). FL, which avoids direct data sharing, achieved higher classification accuracy (precision: 0.88; recall: 0.86; accuracy: 0.87) through global model aggregation, alongside significantly better processing efficiency (FPS: 45). In contrast, the proposed method achieved the best overall performance (precision: 0.93; recall: 0.89; accuracy: 0.91; FPS: 57), indicating that the data obfuscation strategy based on probability density and information entropy strikes a superior balance between privacy protection and task performance. Theoretically, the superior results of the proposed method stem from its innovative design. FHE relies on complex mathematical transformations, ensuring strong security but incurring significant computational costs, reflected in its low FPS value. DP’s approach of introducing random noise provides privacy protection but mathematically diminishes the validity of the data, leading to decreased classification performance. MPC employs secret-sharing protocols to improve task performance to some extent but is limited by high communication overhead in complex scenarios. FL avoids privacy breaches through distributed data modeling, and its global model aggregation strategy significantly enhances classification performance. However, FL may encounter inconsistencies in model aggregation under highly diverse data distributions. The proposed method models the global distribution characteristics of the data through the probability density extraction module and dynamically adjusts the obfuscation intensity via the information entropy fusion module. This ensures, mathematically, that the obfuscated data maintain global consistency and preserve the core features of the original data. The design avoids FHE’s high computational cost, mitigates DP’s noise-induced interference, and surpasses FL’s communication bottlenecks through a lightweight model architecture. Ultimately, its outstanding mathematical properties enable the proposed method to balance efficiency and accuracy in practical applications, providing new possibilities for medical image classification.
4.3. Financial Prediction Regression Results
This experiment aimed to evaluate the performance of different privacy protection strategies in financial prediction regression tasks, with a focus on analyzing the ability of each model to capture the dynamic characteristics in time-series data and their impact on privacy protection. Unlike medical image classification, time-series data emphasize the sequential and trend relationships between data points, posing higher demands on the balance between predictive accuracy and privacy protection. As shown in
Table 4, the proposed method outperformed all other methods across all evaluation metrics (precision: 0.95; recall: 0.91; accuracy: 0.93; FPS: 54), demonstrating its strong adaptability and performance advantages in handling time-dependent data.
Theoretically, the characteristics of time-series data require models to effectively capture trends and changes over time. Traditional methods, such as DP and MPC, exhibit weaker capabilities in preserving such information due to noise interference or communication overhead, which reduce their ability to represent temporal patterns. While FL improves local modeling and global aggregation, it struggles to maintain the integrity of global time-series features in non-independent and identically distributed (Non-IID) data scenarios. The proposed method leverages the probability density extraction module to model the global distribution characteristics of time-series data, effectively capturing trend information. Combined with dynamic obfuscation adjustments through the information entropy fusion module, it ensures that the obfuscated data retain the ability to express critical temporal features. Furthermore, the performance loss term in the fusion loss function directly optimizes the prediction error for time-series data, enabling the model to balance the trade-off between privacy protection and temporal pattern preservation. These designs mathematically ensure the integrity of trend capturing in time-series data, providing a highly efficient and secure solution for financial prediction tasks.
4.4. Impact Analysis and Ablation Study on the Information Entropy Fusion Module
The ablation study on the information entropy fusion module aimed to validate its role in privacy protection and model performance, assessing its impact on the data obfuscation intensity under different entropy control strategies, as shown in
Table 5. The experiment sets up multiple comparison groups, including removing the information entropy fusion module (No Entropy Fusion), using a fixed obfuscation intensity (Fixed Obfuscation), applying random obfuscation (Random Obfuscation), and employing Shannon entropy as the entropy metric (Shannon Entropy), while comparing these against the full model (Baseline Full Model). This study primarily evaluated six key metrics: reconstruction attack success rate (RASR), Wasserstein distance, classification accuracy, F1 score, mean squared error (MSE), and computation delay. RASR reflects the security of the obfuscated data, the Wasserstein distance measures the distributional similarity between the obfuscated and original data, while accuracy, F1 score, and MSE assess the model’s performance under different obfuscation strategies. The results demonstrate that the full model achieved the best balance across all metrics, indicating that the information entropy fusion module effectively adjusts the data obfuscation intensity, enhancing data security while preserving task utility as much as possible.
From a mathematical perspective, these results primarily stem from the ability of the information entropy fusion module to dynamically adjust the obfuscation intensity based on data uncertainty. The superior performance of the baseline group indicates that global information entropy effectively modulates the level of obfuscation across different regions of the data, achieving an optimal trade-off between privacy protection and data utility. In contrast, removing the information entropy fusion module (No Entropy Fusion) increases the RASR to 24.7%, suggesting that obfuscation effectiveness declines, while the Wasserstein distance increases to 0.87, indicating significant deviation from the original data distribution, which, in turn, degrades classification accuracy. Fixed Obfuscation reduces the RASR but fails to adaptively adjust the obfuscation intensity, leading to insufficient protection of critical data regions and a drop in accuracy to 88.2%. Random Obfuscation performs the worst, exhibiting the highest RASR (27.1%), which suggests that arbitrary perturbations disrupt the data structure, significantly impairing task performance. The Shannon entropy approach yields suboptimal performance—while it improves over No Entropy Fusion, it still lags behind the full model, likely due to the limited generalization ability of a single entropy metric in complex data distributions. Therefore, the experimental results strongly validate the effectiveness of the information entropy fusion module, demonstrating that the dynamically entropy-adjusted obfuscation strategy achieves a superior balance between data security and task utility.
4.5. Ablation Study on the Sensitivity of the Proposed Method to Non-Stationary Distributions in Time-Series Data
This experiment aimed to evaluate the stability of the proposed information entropy fusion-based obfuscation method under non-stationary time-series data conditions (i.e., concept drift), particularly in cases where data distribution undergoes gradual changes, sudden shifts, or periodic variations. The adaptability of the obfuscation strategy and its impact on both privacy protection and task performance were assessed. The experiment utilized financial time-series data and synthetic time-series data, simulating real-world applications through various types of concept drift. A systematic comparison was conducted between the full information entropy fusion model (Baseline), a fixed obfuscation strength (Fixed Obfuscation), no obfuscation (No Obfuscation), and random obfuscation (Random Obfuscation) to analyze the effectiveness of the information entropy fusion module. Different types of concept drift were simulated to examine the adaptability of the obfuscation strategy. Gradual Drift: The data distribution changes progressively over time, such as slow market trends rising or declining. Sudden Drift: A drastic change in data occurs at a specific time point, such as a black swan event in financial markets. Recurring Drift: The data distribution follows a periodic pattern, such as economic cycles or market fluctuations. Mixed Drift: A combination of gradual and sudden drift to simulate complex market dynamics, where long-term trends coexist with short-term disruptions.
The experimental results in
Table 6 indicate that the full information entropy fusion model (Baseline) maintained a low reconstruction attack success rate of 11.8% and a high entropy change rate of 0.67 across all concept-drift scenarios, demonstrating that the method effectively adjusted the obfuscation intensity dynamically and preserved strong privacy protection, even when the data distribution shifted. Additionally, the baseline model achieved near-optimal performance in regression tasks (MSE = 0.045; RMSE = 0.212) and classification tasks (accuracy = 91.5%; precision = 0.88; recall = 0.86), indicating that it maintained predictive performance while ensuring privacy. In contrast, the fixed obfuscation strategy (Fixed Obfuscation), due to its lack of adaptation to time-series distribution drift, experienced an increase in RASR to 22.4% and a drop in the entropy change rate to 0.45, reducing privacy protection effectiveness, while its classification and regression performance also degraded (MSE = 0.078; accuracy = 86.2%). The random obfuscation strategy (Random Obfuscation) further exhibited instability, with the highest RASR (27.1%) and a significant drop in task performance (accuracy = 79.6%), indicating that random perturbations failed to maintain the statistical consistency of the data, thereby impairing the model’s learning ability. The no-obfuscation strategy (No Obfuscation) achieved the best task performance (accuracy = 94.3%) but exhibited the highest RASR (35.7%), posing a severe privacy leakage risk.
From a mathematical perspective, the superiority of the baseline model stems from the dynamic adjustment mechanism of the information entropy fusion module, where the obfuscation parameter is adaptively regulated based on the time-series data entropy . This ensures that the level of data obfuscation is adjusted in response to changes in the statistical properties of the data. In the gradual-drift scenario, this mechanism continuously adapts to match long-term trend changes. In sudden-drift scenarios, abrupt entropy changes allow for the rapid adjustment of the obfuscation intensity, effectively countering drastic distribution shifts. In recurring-drift scenarios, the entropy metric captures periodic patterns in the data, balancing privacy protection and task performance. In contrast, the fixed obfuscation strategy, which lacks dynamic adaptation, leads to either insufficient or excessive obfuscation under concept-drift conditions. Since it does not consider data-specific characteristics, the random obfuscation strategy results in suboptimal privacy protection and predictive performance. Consequently, the experimental results confirm the effectiveness of the information entropy fusion module in non-stationary time-series environments and demonstrate its ability to dynamically balance privacy protection and task performance.
4.6. Discussion on Throughput
This study comprehensively compares the throughput (FPS) of different data protection strategies across two tasks: medical image classification and financial prediction regression. The experimental results clearly reveal the differences in throughput and their underlying reasons. Throughput directly reflects the operational efficiency of a model and serves as a crucial metric for evaluating the practical applicability of privacy protection techniques. The results show that traditional methods, such as FHE and DP, have significantly lower throughput compared to other methods due to their inherent design characteristics. FHE’s high computational complexity in encryption and decryption processes results in FPS values of only 23 and 25 for medical image classification and financial prediction, respectively, making it unsuitable for real-time tasks. DP achieves a slight improvement in throughput by introducing noise but remains constrained by the computational overhead of noise generation. In contrast, MPC improves computational efficiency through distributed processing, raising throughput to 30 and 35 for the two tasks. However, its communication overhead poses a major limitation in complex scenarios. Federated learning significantly enhances throughput to 45 in both tasks by leveraging local modeling and avoiding direct data transmission. Nevertheless, the proposed method achieves the highest throughput, reaching 57 and 54 for medical image classification and financial prediction, respectively. This improvement is attributed to its lightweight design, including the probability density extraction module and the information entropy fusion module, as well as the optimization of the fusion loss function, which reduces the computational complexity of data obfuscation. This high throughput not only highlights the superiority of the proposed method’s theoretical design but also demonstrates its practical value in scenarios requiring efficient processing of large-scale data.
4.7. Ablation Study on Different Data Obfuscation Methods
This experiment aimed to validate the advantages of the proposed method in medical image classification and financial prediction regression tasks through an ablation study on different data obfuscation strategies. Specifically, the experiment compared obfuscation methods based on gradient information, pure Gaussian noise, and the proposed strategy based on probability density and information entropy, as shown in
Figure 5.
The goal was to comprehensively analyze the impact of these methods on privacy protection strength, task performance, and operational efficiency. As shown in
Table 7, the gradient-based obfuscation method performed poorly in both tasks, achieving precision, recall, and accuracy of 0.69, 0.64, and 0.67, respectively, for medical image classification, with a throughput of 32 FPS. The corresponding metrics for financial prediction were 0.71, 0.67, and 0.69, with a throughput of 41 FPS. In contrast, the pure Gaussian noise method significantly improved throughput (FPS), reaching 48 and 49 for medical image classification and financial prediction, respectively. However, its disruptive effect on data distribution led to a decline in performance, weakening task-specific characteristics. The proposed method achieved the best results in both tasks, with precision, recall, and accuracy of 0.93, 0.89, and 0.91 for medical image classification and 0.95, 0.91, and 0.93 for financial prediction, while also achieving the highest FPS (57 for medical image classification and 54 for financial prediction). This demonstrates its comprehensive advantages in task performance and operational efficiency.
From a theoretical perspective, the differences in experimental results stem from the varying capabilities of these obfuscation strategies to preserve data features while protecting privacy. The gradient-based obfuscation method achieves privacy protection by introducing gradient perturbations, but its reliance on task model gradient updates makes it susceptible to gradient instability and noise accumulation. This significantly weakens its ability to retain the core features of the original data, particularly in image classification and time-series prediction tasks, resulting in lower task performance. The pure Gaussian noise method relies on random sampling from a Gaussian distribution, which provides substantial privacy protection. However, the global nature of the noise disrupts the alignment between the obfuscated and original data distributions, reducing the model’s ability to capture critical features. The proposed method, on the other hand, leverages the probability density extraction module to model the global distribution of the data, while the information entropy fusion module dynamically adjusts the obfuscation intensity. Mathematically, this ensures that the statistical properties of the obfuscated data align closely with those of the original data, while sensitive information is selectively protected. Additionally, the lightweight architecture of the proposed method effectively reduces computational complexity, further improving throughput. This multi-module collaborative optimization design demonstrates strong adaptability and advantages across both task scenarios, providing a robust theoretical and practical framework for privacy preservation and efficient data analysis.
5. Conclusions
As the demand for data privacy protection continues to grow, balancing privacy preservation and task performance has become a significant challenge in data analysis. Traditional methods such as FHE, DP, MPC, and FL have achieved notable success in privacy protection and data utility. However, these approaches face significant limitations, including high computational complexity, noise interference, and communication overhead. To address these issues, this paper proposes a novel data obfuscation method based on probability density and information entropy, with a design that directly addresses the trade-offs observed in traditional techniques. By integrating a probability density extraction module and an information entropy fusion module, the method models the global distribution of data and dynamically adjusts the obfuscation intensity. Additionally, a fusion loss function is designed to optimize the balance between privacy protection and task performance. Experimental results provide strong evidence for the method’s effectiveness. In medical image classification, the proposed method achieved precision, recall, and accuracy of 0.93, 0.89, and 0.91, respectively, with a throughput of 57 FPS, demonstrating its ability to preserve data utility while ensuring privacy. Compared to FHE (precision: 0.82; throughput: 23 FPS) and DP (precision: 0.84; throughput: 25 FPS), the method achieved clear improvements by reducing computational overhead and noise-induced performance degradation. Similarly, in financial prediction tasks, the proposed method exhibited excellent capability in capturing temporal features, achieving precision, recall, and accuracy of 0.95, 0.91, and 0.93, with a throughput of 54 FPS, far surpassing traditional methods. These results highlight not only the superior performance of the proposed modules but also their ability to adapt to different types of data, such as medical images and time-series data, effectively addressing privacy and utility concerns in diverse scenarios.
Furthermore, ablation experiments provided detailed insights into the contributions of individual components, such as the probability density extraction module and the information entropy fusion module, to the overall system performance. Specifically, the experiments confirmed that the dynamic adjustment of the obfuscation intensity significantly enhances privacy protection while maintaining task relevance. Compared with gradient-based obfuscation and pure Gaussian noise methods, the proposed method preserves statistical characteristics more effectively, ensuring that the obfuscated data remain analytically valid for downstream tasks. In conclusion, the integration of probability density modeling and dynamic obfuscation adjustment offers a robust framework for balancing privacy and performance. The superior experimental results, supported by ablation studies and comparative analyses, underscore the effectiveness of this approach in real-world applications. These findings lay a solid foundation for extending the proposed method to other privacy-sensitive domains while further optimizing its efficiency and scalability.