A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation

Cheng, Haolan; Qiang, Chenyi; Cong, Lin; Xiao, Jingze; Liu, Shiya; Zhou, Xingyu; Wang, Huijun; Ruan, Mingzhuo; Lv, Chunli

doi:10.3390/app15031261

Open AccessArticle

A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation

by

Haolan Cheng

^1,†,

Chenyi Qiang

^1,†,

Lin Cong

^1,†,

Jingze Xiao

^1,2,

Shiya Liu

^1,3,

Xingyu Zhou

^1,4,

Huijun Wang

^1,5,

Mingzhuo Ruan

^1,6 and

Chunli Lv

^1,*

¹

China Agricultural University, Beijing 100083, China

²

School of Information and Communication Engineering, Beijing University of Post and Telecommunication, Beijing 100876, China

³

School of Landscape Architecture, Beijing Forest University, Beijing 100083, China

⁴

College of Mathematics and Physics, North China Electric Power University, Beijing 102206, China

⁵

National School of Development, Peking University, Beijing 100871, China

⁶

Department of History, Rixin College, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(3), 1261; https://doi.org/10.3390/app15031261

Submission received: 2 December 2024 / Revised: 23 January 2025 / Accepted: 24 January 2025 / Published: 26 January 2025

(This article belongs to the Special Issue Cloud Computing: Privacy Protection and Data Security)

Download

Browse Figures

Versions Notes

Abstract

:

Data privacy protection is increasingly critical in fields like healthcare and finance, yet existing methods, such as Fully Homomorphic Encryption (FHE), differential privacy (DP), and federated learning (FL), face limitations like high computational complexity, noise interference, and communication overhead. This paper proposes a novel data obfuscation method based on probability density and information entropy, leveraging a probability density extraction module for global data distribution modeling and an information entropy fusion module for dynamically adjusting the obfuscation intensity. In medical image classification, the method achieved precision, recall, and accuracy of 0.93, 0.89, and 0.91, respectively, with a throughput of 57 FPS, significantly outperforming FHE (0.82, 23 FPS) and DP (0.84, 25 FPS). Similarly, in financial prediction tasks, it achieved precision, recall, and accuracy of 0.95, 0.91, and 0.93, with a throughput of 54 FPS, surpassing traditional approaches. These results highlight the method’s ability to balance privacy protection and task performance effectively, offering a robust solution for advancing privacy-preserving technologies.

Keywords:

privacy protection and data security; probability density modeling; data obfuscation strategy; multi-objective optimization

1. Introduction

With the rapid development of information technology, data have become essential resources in modern society. However, the efficient utilization of data has also introduced serious security risks [1,2,3], particularly in fields such as healthcare, finance, and economics. Data security issues not only concern individual privacy but may also threaten the healthy development of industries and even social stability [4]. In healthcare scenarios, data such as medical images, health records, and diagnostic information are often highly sensitive [5,6]. If such information is leaked, it could be misused for identity theft, insurance fraud, and other illegal activities [7]. Furthermore, medical data have significant research value, and unprotected data could be exploited without authorization, leading to intellectual property violations [8]. In banking and financial scenarios, if transaction records, account balances, or other financial data are leaked or stolen, it could cause economic losses for individuals and companies and may even trigger a credit crisis [9]. For example, stolen high-frequency trading data could lead to market manipulation, creating instability in the entire financial system [10]. In the field of economic data protection, the economic operation data of enterprises and nations serve as a crucial basis for decision making [11]. If these data are leaked, they could be exploited by competitors, weakening market competitiveness. At the national level, the leakage of economic data could become a strategic tool for adversaries, threatening national economic security [12]. Therefore, how to effectively protect data security during data sharing and analysis has become a key focus for society [5]. The development of data protection technologies not only ensures the privacy and security of individuals but also provides important support for the healthy development of industries, further promoting the deep exploration and application of data value [13].

In the realm of data protection technologies, traditional solutions mainly fall into two categories: encryption-based methods and data obfuscation-based methods [14]. Fully Homomorphic Encryption (FHE) [15] is an encryption method with theoretical completeness that allows computations on encrypted data without decrypting it [16]. This technique ensures high data security through mathematical transformations, representing a significant breakthrough in protecting data privacy. However, the computational overhead of FHE is extremely high, making it unsuitable for practical applications [17]. For instance, in real-time medical data analysis, the high latency of FHE could fail to meet clinical diagnosis time requirements. Differential privacy (DP) [18] is a widely applied data privacy protection technique that introduces noise into the data to protect individual privacy [19]. However, a major issue with DP is that the introduction of noise inevitably impacts the data’s effectiveness, resulting in a decrease in the accuracy of data analysis [20]. For example, in financial forecasting, DP may obscure key details of stock price fluctuations, thus reducing the performance of prediction models. In summary, traditional encryption and obfuscation techniques have a clear conflict between data security and system performance: encryption methods offer high security but poor performance, while obfuscation methods achieve better performance at the cost of some security. This conflict is particularly prominent in modern application scenarios, necessitating new technologies to balance both.

In recent years, with the rapid development of deep learning technologies, deep learning-based models have gradually become an important tool in data protection [21,22]. Deep learning has demonstrated powerful capabilities in feature extraction, pattern recognition, and complex task optimization, providing new ideas for data security protection [23,24,25]. Amanullah et al. [26] proposed a deep learning- and big data-based framework to enhance the security of IoT devices, particularly in terms of adaptability and response speed when facing various network attacks, achieving an accuracy of 98.7%, a precision of 97.5%, and a recall of 96.2%. Chen et al. [27] proposed several deep learning-based network security protection methods aimed at enhancing the defense capabilities and response speeds of systems in smart cities (e.g., traffic, energy, and healthcare) when facing cyber-attacks. Their approach achieved an accuracy of 95.4%, a false positive rate of 2.3%, and an average response time of 0.45 s. However, it was noted that deep learning model training requires a large amount of labeled data and may not perform efficiently on devices with limited computational resources.

Ferrag et al. [28] introduced a new IoT network security protection method by combining federated learning and deep learning, enabling training on multiple IoT nodes without data centralization, thus protecting user privacy. The method showed a detection accuracy of 97.3%, with training efficiency improved by 20% and communication overhead reduced by 15%. However, the method’s performance may be suboptimal in high-latency network environments, and it still faces challenges with computational and communication loads on resource-constrained devices. Kshirsagar et al. [29] proposed a deep learning-based multimedia security solution encompassing identity authentication, encryption, and information hiding to enhance multimedia data security. The experimental results showed that the authentication accuracy was 98.2%, the encryption speed was 200 MB/s, and the information hiding robustness was 95.7%. However, the method depends on high computational resources and may face performance bottlenecks when processing high-resolution data. Rahman et al. [30] studied the security threat posed by adversarial samples to deep learning models for COVID-19, proposing protection methods to enhance model robustness. The experiments showed that the original model had an accuracy of 95.6%, but when subjected to adversarial attacks, the accuracy dropped to 68.3%. This method increases computational overhead and may not be effectively implemented on resource-limited devices. Deep learning models can learn the semantic features of data and abstract them into higher-level representations. This abstraction significantly reduces the risk of direct leakage of original data [31].

Despite the numerous tools that deep learning offers for data security, its application in privacy protection still faces challenges. For instance, deep learning models themselves may become a vector for privacy leakage (such as model inversion attacks), and the high computational costs in both model training and inference still limit their large-scale application in practical scenarios. In response to these challenges, a data obfuscation security framework based on probability density and information entropy is proposed. By combining the feature extraction of probability distributions and the dynamic control of information complexity, this framework achieves a balance between data security and system performance. Compared to existing methods, the innovations of this framework include the following aspects:

Probability density extraction module: This module is designed to extract probability distribution characteristics from raw data. Compared to traditional random noise methods, it can more accurately capture data characteristics, thereby reducing the performance loss in downstream tasks.
Information entropy fusion module: This module quantifies the data’s complexity using information entropy and dynamically adjusts the strength of data obfuscation by combining Shannon entropy and conditional entropy. It can effectively protect sensitive information while ensuring the preservation of core data features.
Multi-scenario validation: The framework’s generality and effectiveness are validated in two typical application scenarios, healthcare and finance. Experiments demonstrate that the proposed method outperforms existing approaches in terms of accuracy, throughput, and security.
Balance between performance and security: The proposed method significantly improves system efficiency while achieving efficient obfuscation, reducing the computational overhead of encryption and decryption processes and making it suitable for real-time data analysis scenarios.

In conclusion, the proposed method provides a new approach to data protection across multiple scenarios by balancing data security and performance.

2. Related Works

2.1. Encryption-Based Security Methods

Data encryption techniques have always played a significant role in protecting data security. FHE [15,32] is a theoretically complete encryption method that allows computations to be performed directly on encrypted data without decryption. This feature makes it particularly important in scenarios involving highly sensitive data processing [33,34]. The fundamental idea of FHE is to define encryption and decryption functions,

E n c (\cdot)

and

D e c (\cdot)

, in such a way that operations performed on ciphertext map to equivalent operations on the plaintext. Specifically, for plaintexts x and y, the encryption function satisfies the following relation:

D e c (E n c (x) \oplus E n c (y)) = x + y

(1)

where ⊕ denotes an operation in the ciphertext domain, allowing addition to be carried out on encrypted data. Similarly, FHE supports multiplication, with the following relation:

D e c (E n c (x) \otimes E n c (y)) = x \cdot y

(2)

This property allows arbitrary polynomial computations to be performed in the ciphertext domain. However, since each operation requires high computational complexity for encoding and decoding, the computational cost of FHE is extremely high, especially when dealing with large-scale data. As the data size increases, the performance overhead grows exponentially [35]. This makes FHE unsuitable for real-time applications, such as medical image analysis or high-frequency financial trading predictions. In contrast, Secure Multi-party Computation (SMC) [36] divides the data into several parts and distributes them to different participants for joint computation, without revealing the original data of any individual participant. The core idea of SMC is to break down a computation task into multiple sub-tasks, which are executed by different participants, and the results are later integrated to complete the calculation [37]. For example, an addition task

x + y

, where x can be split into

x_{1}

and

x_{2}

such that

x = x_{1} + x_{2}

, with

x_{1}

and

x_{2}

being distributed to different participants. Similarly, y can be split into

y_{1}

and

y_{2}

. This method has the advantage of not requiring centralized data processing, but the drawback lies in the frequent communication needed to ensure the correctness of the computation. The communication cost increases significantly with the number of participants and the complexity of the data, leading to performance bottlenecks in high-dimensional data or multi-node scenarios.

2.2. Data Obfuscation-Based Security Methods

Compared to encryption techniques, data obfuscation methods focus on directly perturbing the data itself to achieve privacy protection. DP [38] is one of the most representative methods in this category. The basic principle of DP is to introduce random noise into data queries or statistical results, making it impossible for an attacker to infer specific information about an individual data point [39,40,41]. Mathematically, DP is defined as follows: for any two adjacent datasets D and

D^{'}

, if an algorithm

A

satisfies

P [A (D) \in S] \leq e^{ϵ} P [A (D^{'}) \in S]

(3)

where S is any subset of the output space and

ϵ

is the privacy budget,

A

is said to satisfy

ϵ

-differential privacy. A smaller

ϵ

means better privacy protection, but it also results in a larger amount of noise being introduced, which can affect the data’s usability. Specifically, the Laplace mechanism is commonly used to add noise

η

to the query results

f (D)

to achieve differential privacy, where the noise

η

follows the distribution

η \sim Lap (0, \frac{Δ f}{ϵ})

(4)

where

Δ f

is the global sensitivity of the function f, which measures the maximum impact of a single data point’s change on the output. While this method performs well in privacy protection, the introduction of random noise can severely affect the accuracy of the data. For instance, in financial prediction tasks, random perturbation of key features may obscure important market trends, leading to reduced prediction accuracy. FL [42] is a distributed machine learning method that focuses on local data processing. Models are trained locally at each participant, and their gradient information is aggregated to protect privacy. The training process can be represented as

θ_{t}^{k + 1} = θ_{t}^{k} - η \nabla L (θ_{t}^{k}; D_{k})

(5)

where

θ_{t}^{k}

represents the model parameters at the k-th node at round t,

η

is the learning rate, L is the loss function, and

D_{k}

is the dataset of the k-th node. The advantage of FL is that data do not need to be centralized, which reduces the risk of data leakage. However, in complex data scenarios (such as multi-modal data or highly non-independent and identically distributed data), the performance of FL may be limited by the effectiveness of model aggregation, and communication overhead remains a major bottleneck in real-world deployments.

The limitations of the aforementioned methods are addressed by the proposed data obfuscation strategy, which combines probability density and information entropy for targeted data protection. In traditional methods, while FHE offers high security, its computational overhead makes it unsuitable for real-time scenarios. DP, although superior in performance, significantly impacts data usability. Federated learning shows distinct advantages in distributed scenarios but still faces technical challenges in model aggregation and complex environments. By combining feature extraction based on probability density with dynamic control through information entropy, the proposed approach achieves a significant improvement in both data security and task performance. This new solution offers an innovative way to protect sensitive data in healthcare, finance, and other domains where privacy is critical.

3. Materials and Method

3.1. Dataset Collection

A rich and diverse dataset is a crucial prerequisite for the research presented in this study. Large-scale data were collected from two key domains: medical imaging and finance, ensuring both comprehensiveness and depth in the study. Medical imaging data were primarily sourced from internationally recognized image databases, such as Radiopaedia [43] and OpenI. These datasets cover a variety of diseases and cases, such as pneumonia, lung cancer, and brain tumors, with each type of image annotated with detailed case descriptions and diagnostic results. Classification tasks for these datasets include identifying disease severity (e.g., healthy, mild, and severe cases) or distinguishing tumor types in MRI images. These sources were chosen over others for their reliability, standardization, and comprehensive coverage, enhancing the dataset’s robustness and applicability to real-world scenarios. Financial data were obtained from leading financial market websites and included daily stock prices, trading volumes, and macroeconomic indicators. These data have been standardized and span various time periods and market conditions. Specific classification tasks in this domain include predicting price trends (up, down, or unchanged) and categorizing risk levels (low, medium, or high risk) based on historical financial metrics. This setup ensures that the dataset aligns with practical applications in financial prediction and risk assessment. The integration and analysis of these datasets aim to provide a thorough investigation of data obfuscation security methods, demonstrating the versatility of the proposed approach across diverse domains.

3.1.1. Image Data

During the collection of medical imaging data, various types of images were selected, including X-rays, CT scans, and MRI scans, with all data sourced from Radiopaedia [44] and the OpenI [45] medical image repositories. The number of images for each type was determined based on its clinical relevance, frequency of use in diagnostic procedures, and importance in addressing diverse medical scenarios. X-rays were prioritized due to their widespread use in diagnosing common conditions like chest diseases, while CT and MRI scans were included to represent more complex cases requiring advanced imaging techniques. Special attention was given to selecting images that reflect a variety of common diseases and complex cases. To ensure an adequate representation of each modality, the dataset was balanced with 5912 X-rays, 9649 CT scans, and 7345 MRI scans, which provided sufficient data for deep learning model training and evaluation without introducing significant biases, as shown in Table 1. Each image was accompanied by complete case descriptions and diagnostic results, ensuring both the richness and practical utility of the dataset. Additionally, rigorous data quality checks were applied to exclude low-quality or incomplete images, further enhancing the dataset’s robustness and reliability.

3.1.2. Financial Data

For financial data collection, the focus was primarily on stock price information, sourced from financial information service websites such as Yahoo Finance and Google Finance. Key financial indicators, including stock prices, trading volumes, and market capitalization, were systematically collected. The data span from 2013 to 2024, covering different economic cycles, allowing for an analysis of the impact of economic cycles on the stock market. The data attributes for different years are shown in Table 2. To enhance the reliability and accuracy of the research, necessary data cleaning and preprocessing steps were carried out, including the removal of non-trading days, adjustments for stock splits, and the filling in of missing values. This data preparation provides precise baseline information, establishing a solid foundation for subsequent data obfuscation and security analysis.

3.2. Data Augmentation

Data imputation, cleaning, and augmentation are essential steps in building high-quality datasets, directly influencing the performance of subsequent model training and evaluation. Due to differences in data sources and collection methods, issues such as missing data, imbalanced datasets, or abnormal distributions often arise in both financial and medical imaging data. Specific imputation and cleaning methods, as well as data augmentation strategies, have been designed for both financial and medical imaging data to support efficient model training.

3.2.1. Financial Dataset Imputation and Cleaning

Financial data typically appear in the form of time series, such as stock prices, trading volumes, and market indices. These data are strongly time-dependent, but missing values often arise due to reasons like market closures or data collection failures. To ensure data completeness and consistency, missing values must be imputed in a reasonable manner. The basic idea behind missing value imputation is to predict the location of the missing values based on known data points. Common methods include interpolation and time-series forecasting. Interpolation estimates missing values by analyzing the trend between known data points. Let

y_{t}

denote the value at time t, and assuming that a missing value exists at t, the value can be estimated using linear interpolation as follows:

y_{t} = y_{t - 1} + \frac{y_{t + 1} - y_{t - 1}}{t_{t + 1} - t_{t - 1}} (t - t_{t - 1}),

(6)

where

t_{t - 1}

and

t_{t + 1}

are the nearest two non-missing time points. This method assumes linearity between data points and is suitable for small gaps. For time series with more complex nonlinear trends, time-series forecasting models, such as ARIMA or LSTM, can be used to fill in missing values. For example, in an LSTM model, the input time-series data

X = [x_{1}, x_{2}, \dots, x_{t}]

pass through the neural network to obtain a predicted value

{\hat{x}}_{t + 1}

as follows:

{\hat{x}}_{t + 1} = f (X; θ),

(7)

where

θ

represents the model parameters and

f (\cdot)

is the mapping function of LSTM. The predicted value

{\hat{x}}_{t + 1}

is used to fill in the missing value, which helps better capture the time dependence in the data. After completing the missing value imputation, financial data also need to undergo normalization to meet the requirements for model training. Normalization maps data to a specific range, such as

[0, 1]

, reducing the impact of different feature scales. Let x denote the data value, with

x_{max}

and

x_{min}

being the maximum and minimum values of x. The normalized value is calculated as

x^{'} = \frac{x - x_{min}}{x_{max} - x_{min}} .

(8)

Normalization helps improve the convergence speed of the optimization process and is particularly useful for financial data training tasks.

3.2.2. Medical Image Data Augmentation

The quality of medical imaging data directly impacts the performance of models in diagnostic tasks. However, due to issues such as insufficient samples or class imbalance in real-world data collection, raw data often cannot meet the deep learning models’ need for large-scale, high-quality training datasets. Data augmentation techniques generate more diverse samples by applying a series of transformations to the raw data, as shown in Figure 1, while retaining key image features, thus significantly improving the model’s generalization ability.

Common medical image augmentation methods include random rotation and flipping, which modify the spatial arrangement of the image to increase data diversity. For example, for a 2D image

I (x, y)

, the random rotation operation can be expressed as

I^{'} (x^{'}, y^{'}) = I (x cos θ - y sin θ, x sin θ + y cos θ),

(9)

where

θ

is the rotation angle and

(x^{'}, y^{'})

are the new coordinates after rotation. Random flipping reverses the image along either the horizontal or vertical axis to generate augmented images. Gaussian blur and sharpening techniques adjust the spatial frequency distribution of the image to simulate different imaging conditions. Gaussian blur can be applied by convolving the image

I (x, y)

with a Gaussian kernel K as follows:

I^{'} (x, y) = \sum_{u = - k}^{k} \sum_{v = - k}^{k} K (u, v) I (x + u, y + v),

(10)

where

K (u, v) = \frac{1}{2 π σ^{2}} exp (- \frac{u^{2} + v^{2}}{2 σ^{2}})

and

σ

controls the degree of blurring. Conversely, sharpening enhances edge details in the image, increasing its spatial frequency. Synthetic noise augmentation adds random noise to the image to improve the model’s robustness to varying image qualities. If the original image is denoted by

I (x, y)

and the noise image by

η (x, y)

, the augmented image is

I^{'} (x, y) = I (x, y) + η (x, y),

(11)

where

η (x, y)

can be Gaussian noise (with mean 0 and standard deviation

σ

) or salt-and-pepper noise, depending on the task requirements. The addition of noise simulates fluctuations in image quality due to different imaging devices or environments, thereby enhancing the model’s adaptability to diverse data.

These data augmentation techniques play a crucial role in addressing issues such as insufficient samples and class imbalance in medical image datasets. They help generate a more diverse set of training samples, improving the model’s robustness and performance in real-world applications.

3.3. Proposed Method

This study proposes a data obfuscation security method based on probability density and information entropy, aiming to ensure data security while maintaining high usability and analytical accuracy, as shown in Figure 2.

The entire workflow is divided into five main modules: data input, probability density extraction, information entropy fusion, obfuscation strategy optimization, and final output. The following sections provide a detailed explanation of each module and their interactions. The processed data first enter the probability density extraction module, which is responsible for extracting the probability distribution characteristics of the raw data. This step provides essential information for the subsequent obfuscation strategy. After extracting the probability density features, the data flow to the information entropy fusion module, which serves as the core of the proposed method. The primary objective of this module is to dynamically adjust the obfuscation intensity using information entropy, thereby ensuring data privacy while minimizing the impact on task performance. Subsequently, the data processed by the information entropy fusion module are directed to the fusion loss function module for optimization. In this module, a fusion loss function is designed to simultaneously optimize data security and usability.

3.3.1. Data Obfuscation Strategy Based on Probability Density and Information Entropy

This study proposes a data obfuscation strategy based on probability density and information entropy to address the trade-off between data privacy protection and analytical performance. By coordinating the probability density extraction module and the information entropy fusion module, the strategy achieves the targeted obfuscation of data, balancing privacy protection and task performance, as shown in Figure 3. Figure 3 illustrates the architecture of the proposed data obfuscation strategy, which is composed of three main layers: the Training Management Layer, the Privacy Evaluation Layer, and the Feature Processing Layer. The Training Management Layer oversees the optimization process by iteratively adjusting the parameters based on the loss function, ensuring a balance between privacy protection and task performance. The Privacy Evaluation Layer integrates a privacy discriminator and sensitive information detector, which dynamically assess the risk of privacy leakage and provide feedback to guide the obfuscation intensity. The Feature Processing Layer consists of the Encoder Network, Attribute Suppression Module, and Decoder Network, which work collaboratively to extract features, suppress sensitive attributes, and reconstruct data, retaining core utility while obfuscating sensitive information. These components interact seamlessly to provide a comprehensive solution for data obfuscation.

In its implementation, the probability density extraction module first models the distribution features of the raw data. The raw data

X = {x_{1}, x_{2}, \dots, x_{n}}

are fed into a kernel density estimator to compute its probability density function (PDF). The kernel density estimation is given by

\hat{f} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h}),

(12)

where

K (\cdot)

is a Gaussian kernel function and h is the bandwidth parameter. The distribution features generated by kernel density estimation are further processed by a deep neural network. The network consists of three convolutional layers (each with kernel size

3 \times 3

, 64 channels, and ReLU activation), followed by two fully connected layers with 128 and 64 nodes, respectively. The final output is a high-dimensional distribution feature vector representing the data’s distribution pattern in the feature space. The probability density feature vector is then fed into the information entropy fusion module. This module dynamically adjusts the obfuscation intensity using information entropy,

H (X)

, a key metric for measuring data uncertainty, defined as

H (X) = - \sum_{i} P (x_{i}) log P (x_{i}),

(13)

where

P (x_{i})

represents the probability of data point

x_{i}

. Information entropy is computed based on the distribution features generated by the probability density extraction module, with the intensity of obfuscation adjusted based on these entropy values. For data points with higher information entropy (indicating higher sensitivity), stronger obfuscation is applied. Conversely, data points with lower entropy (indicating lower sensitivity) are subject to milder obfuscation. The dynamic adjustment of the obfuscation intensity allows the model to target data regions that are more sensitive to perturbations, applying stronger obfuscation where privacy is more critical while minimizing the impact on less sensitive regions. The obfuscation process is implemented using the following formula:

M (x) = α \cdot G (x) + β \cdot T (x),

(14)

where

G (x)

represents Gaussian noise-based obfuscation,

T (x)

represents gradient perturbation-based obfuscation, and

α

and

β

are dynamically generated weights based on information entropy, satisfying

α + β = 1

. Gaussian noise follows the distribution

η \sim N (0, σ^{2}),

(15)

while gradient perturbation is generated using the gradient information of the task model, defined as

T (x) = x + ϵ \cdot \nabla_{x} L (f (x), y),

(16)

where

ϵ

is the perturbation strength and

\nabla_{x} L (f (x), y)

is the gradient of the loss function L with respect to the input data x. The information entropy fusion module also includes an entropy map generator to apply higher-intensity obfuscation to sensitive regions in specific contexts. For example, in medical imaging tasks, the entropy map identifies lesion areas for targeted protection. Local entropy is computed by combining distribution features and domain priors, using the formula

H_{local} (x) = - \sum_{j} P_{j} (x) log P_{j} (x),

(17)

where

P_{j} (x)

represents the probability distribution of pixel x within a local window. By combining probability density extraction and information entropy fusion, the proposed method preserves the global distribution of the data while protecting sensitive information. A mathematical analysis of this design demonstrates that the probability density extraction module provides precise modeling of data distributions, ensuring that obfuscated data retain the core characteristics required for analytical tasks. Meanwhile, the information entropy fusion module dynamically adjusts the obfuscation intensity, reducing the adverse effects of over-obfuscation on task performance.

In practical applications, this design offers several advantages. First, the efficient modeling capability of the probability density extraction module enhances the obfuscation strategy’s adaptability to data distributions, ensuring that obfuscated data align with downstream task requirements. Second, the dynamic adjustment mechanism of the information entropy fusion module significantly reduces the conflict between privacy protection and task performance, making it suitable for various scenarios, such as medical image analysis and financial data prediction. Finally, the inclusion of the entropy map generator allows for the targeted protection of sensitive regions within the data, providing an effective solution for privacy-sensitive domains like healthcare and finance.

3.3.2. Probability Density Extraction Module

The proposed probability density extraction module aims to capture the global distribution characteristics of raw data, providing precise probabilistic features to the subsequent information entropy fusion module, as shown in Figure 4. Unlike attention mechanisms that focus on feature weighting, this module emphasizes the consistency of data distribution on a global scale. While attention mechanisms typically assign different weights to specific regions of features to enhance task-specific performance, the probability density extraction module models the data’s probability distribution to ensure that the obfuscated data retain statistical characteristics similar to the original data.

In its design, the probability density extraction module integrates kernel density estimation (KDE) and convolutional neural networks (CNNs) to transform raw data into high-dimensional distribution features. The input data have dimensions

W \times H \times C

, where

W = 128

,

H = 128

, and

C = 3

, which are chosen to accommodate common image inputs. First, KDE calculates the probability distribution of the data. The formula for KDE is as follows:

\hat{f} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h}),

(18)

where

K (x)

is the Gaussian kernel function defined as

K (x) = \frac{1}{\sqrt{2 π}} e^{- \frac{x^{2}}{2}} .

(19)

This process generates the initial probability density distribution of the data, capturing its global characteristics. Subsequently, the input data undergo feature extraction through a three-layer convolutional neural network. The first convolutional layer uses kernels of size

3 \times 3

, with 64 channels, a stride of 1, and ReLU as the activation function. The second convolutional layer increases the number of channels to 128, with the same kernel size but a stride of 2. The third convolutional layer uses 256 channels, with a kernel size of

3 \times 3

and a stride of 2. The computation for each convolutional layer is defined as

F_{l} = σ (W_{l} * F_{l - 1} + b_{l}),

(20)

where ∗ denotes the convolution operation,

W_{l}

and

b_{l}

represent the weights and biases of the l-th layer, and

σ

is the ReLU activation function. After the convolution operations, the feature maps are flattened and passed through the fully connected layers to generate high-dimensional distribution feature vectors. This process is represented as

Z = ϕ (W Z_{l - 1} + b),

(21)

where

ϕ

is a nonlinear activation function. The final output is a distribution feature vector with a dimension of

1 \times 64

. To validate the accuracy and fidelity of the estimated probability density, several additional validation methods are employed beyond the L_density loss term. In practice, the performance of the KDE-based density estimation is evaluated through its consistency with the true data distribution using the Kullback–Leibler (KL) divergence, which measures the discrepancy between the true and estimated probability distributions. The KL divergence is defined as

D_{KL} (P | | Q) = \sum_{x} P (x) log \frac{P (x)}{Q (x)},

(22)

where

P (x)

represents the true probability density and

Q (x)

is the estimated probability density generated by the module. Minimizing the KL divergence ensures that the estimated density closely matches the true data distribution, providing a more robust validation of the global distribution characteristics captured by the module. Additionally, the ability of the module to retain global distribution characteristics is further validated through a data reconstruction process. In this process, new data samples are generated using the estimated probability density and compared to the original data to ensure that the generated samples preserve key statistical features, such as mean and variance. If the reconstructed samples closely match the original data in terms of these statistical measures, it indicates that the module effectively retains the data’s global distribution features. Furthermore, the module’s ability to handle various data subsets is tested to ensure consistency across different types of data. This is done by applying KDE to different data subsets (e.g., different categories, time periods, or regions) and comparing the distribution properties of these subsets. If the density estimates exhibit consistent global statistical characteristics across different subsets, it further validates that the probability density extraction module is capturing the global distribution of the data in a comprehensive manner.

This module’s design offers several significant advantages. First, by combining kernel density estimation (KDE) with convolutional feature extraction, it ensures that the obfuscated data statistically retain the original data’s distribution properties. The use of KDE allows for non-parametric estimation of the data’s probability distribution, which is particularly effective for capturing complex and high-dimensional data distributions without making assumptions about the underlying distribution. Unlike parametric methods, KDE is flexible and can adapt to various data characteristics. Convolutional neural networks (CNNs) are employed to enhance this process by extracting high-level features that capture the underlying patterns in the data. CNNs are well suited for handling large, high-dimensional datasets such as images, where local patterns and hierarchical structures are important. This combination of KDE and CNNs ensures a robust, flexible approach to modeling data distributions that can adapt to different types of data while preserving the statistical characteristics of the original data. Second, as the output features represent global distribution characteristics rather than task-specific target features, the module achieves a high degree of task independence and adaptability. Finally, the module seamlessly integrates with the information entropy fusion module, providing high-quality probabilistic features to support the effectiveness of the obfuscation strategy.

3.3.3. Fusion Loss Function

The proposed fusion loss function is designed to balance the trade-off between data obfuscation security and task performance, ensuring that the obfuscated data protect privacy while minimally impacting downstream task model training and inference. Unlike traditional loss functions that optimize a single objective (such as task performance or privacy protection), the fusion loss function combines security loss and performance loss into a multi-objective optimization framework. This approach ensures that the obfuscated data achieve an optimal balance between privacy and task relevance. The fusion loss function consists of two components: security loss and performance loss. It is defined as

L_{fusion} = λ_{1} L_{\sec urity} + λ_{2} L_{performance},

(23)

where

λ_{1}

and

λ_{2}

are weighting parameters that balance the contributions of the two loss components to the overall optimization. Security loss quantifies the privacy protection effect of the obfuscation strategy by comparing the information entropy of the obfuscated data with that of the original data. It is defined as

L_{\sec urity} = \frac{1}{n} \sum_{i = 1}^{n} |H (x_{i}) - H (M (x_{i}))|,

(24)

where

H (x_{i})

and

H (M (x_{i}))

represent the information entropy of the original data

x_{i}

and the obfuscated data

M (x_{i})

, respectively. Minimizing this loss ensures an appropriate entropy difference between the original and obfuscated data, enhancing privacy protection strength. Performance loss measures the impact of the obfuscated data on the task model and is defined as

L_{performance} = \frac{1}{m} \sum_{j = 1}^{m} {|f (x_{j}) - f (M (x_{j}))|}^{2},

(25)

where

f (x_{j})

and

f (M (x_{j}))

denote the task model outputs for the original data

x_{j}

and the obfuscated data

M (x_{j})

, respectively. Minimizing this loss ensures that the predictions on the obfuscated data remain as close as possible to those on the original data, reducing the performance impact.

The fusion loss function is closely integrated with the information entropy fusion module, dynamically adjusting the obfuscation intensity to further optimize the data obfuscation strategy. The obfuscated data

M (x)

generated by the information entropy fusion module are defined as

M (x) = α \cdot G (x) + β \cdot T (x),

(26)

where

G (x)

represents Gaussian noise-based obfuscation,

T (x)

represents gradient perturbation-based obfuscation, and the weights

α

and

β

are dynamically generated based on information entropy, satisfying

α + β = 1

. By substituting

M (x)

into the fusion loss function, the calculations for both security loss and performance loss are based on the dynamically adjusted obfuscated data, enabling more flexible privacy protection and task optimization.

To demonstrate that the fusion loss function achieves a balance between privacy protection and task performance, we define the following optimization objective:

L = λ_{1} (\frac{1}{n} \sum_{i = 1}^{n} |H (x_{i}) - H (M (x_{i}))|) + λ_{2} (\frac{1}{m} \sum_{j = 1}^{m} {|f (x_{j}) - f (M (x_{j}))|}^{2}) .

(27)

The first term represents the privacy protection objective, while the second term represents the task performance objective. By jointly optimizing this objective function, the obfuscated data can meet privacy protection requirements while retaining features useful for the task model. During gradient descent, the optimization direction for the obfuscated data

M (x)

is determined by the following partial derivative:

\frac{\partial L}{\partial M (x)} = λ_{1} \cdot \frac{\partial L_{\sec urity}}{\partial M (x)} + λ_{2} \cdot \frac{\partial L_{performance}}{\partial M (x)} .

(28)

By adjusting the weights

λ_{1}

and

λ_{2}

, the objectives of privacy protection and task performance can be dynamically balanced, achieving the optimal obfuscation strategy. The fusion loss function is designed with the dual objectives of privacy protection and task performance, offering the following advantages:

Introducing information entropy as a metric for privacy protection ensures that the obfuscated data effectively protect sensitive information while preserving global distribution characteristics.
The design of performance loss directly quantifies the impact of the obfuscated data on downstream models, ensuring that the obfuscated data remain applicable to the original task.
Combined with the information entropy fusion module, the fusion loss function enables dynamic adjustment of obfuscation strategies, making it adaptable to varying privacy protection needs and task requirements.

3.4. Experimental Setup

3.4.1. Hardware and Software Platforms

The experiments were conducted on an NVIDIA A100 GPU, a high-performance unit designed for artificial intelligence and high-performance computing, offering powerful computational capabilities, a large memory capacity (up to 80 GB), and Tensor cores for accelerating deep learning tasks. To enhance efficiency, the Multi-Instance GPU (MIG) feature was utilized for resource sharing and multi-tasking. On the software side, PyTorch 1.8 served as the primary framework for model construction and training, leveraging its dynamic computational graph and user-friendly API. CUDA and cuDNN were integrated to accelerate key operations like convolution and backpropagation, while NumPy 2.x, Pandas, and Matplotlib were employed for data processing and visualization. This hardware–software integration ensured efficient data handling, model training, and results analysis, supporting this study’s experimental needs.

3.4.2. Optimizer Configuration and Hyperparameter Settings

The Adam optimizer, known for its superior performance in deep learning, was selected for this study due to its ability to achieve fast convergence and stability. Adam combines momentum and adaptive learning rates, dynamically adjusting the step size based on the first- and second-moment estimates of the gradients during each parameter update. Specifically, the update rules for Adam are as follows:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}, v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2},

(29)

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}},

(30)

θ_{t + 1} = θ_{t} - η \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ},

(31)

where

m_{t}

and

v_{t}

represent the first and second-moment estimates of the gradients,

β_{1}

and

β_{2}

are the decay rates, and

η

is the learning rate, which was set to 0.001 for this study. The adaptive nature of Adam makes it particularly effective at handling sparse gradients or noisy data. The hyperparameters

β_{1}

and

β_{2}

were set to 0.9 and 0.999, respectively, following standard settings commonly used for the Adam optimizer. To ensure the model’s robustness across different data distributions, the dataset was divided and preprocessed appropriately. The dataset was split into training, validation, and test sets with an 80:10:10 ratio, with 80% of the data used for model training, 10% for validation to adjust the hyperparameters, and the remaining 10% for evaluating the model’s final performance. The dataset used for training consisted of 10,000 samples, the validation set contained 1250 samples, and the test set also included 1250 samples.

During training, 5-fold cross-validation was employed, where the training data were divided into five subsets. In each round, one subset was used as the validation set, while the remaining subsets were used for training. This process was repeated five times, and the final model performance was evaluated based on the average performance across all five folds. The average results from cross-validation help mitigate errors arising from a single split, thus improving the model’s generalization ability. To ensure consistent evaluation across different tasks, cross-validation was conducted for both the medical image dataset and the financial dataset separately, allowing for task-specific adjustments. The specific hyperparameter settings involved grid search optimization for the obfuscation strength parameter. The obfuscation strength

α

, which determines the level of perturbation applied in the data obfuscation strategy, was searched within the range of

[0.1, 1.0]

with a step size of 0.1. For each value of

α

, the model was trained and validated, and the optimal value was selected based on performance metrics on the validation set. In each experiment, the value of

α

was applied during model training, and the optimal value was selected based on performance metrics (e.g., accuracy and precision) on the validation set. The grid search was conducted over a total of 10 different values of

α

. Grid search systematically explores the parameter space, ensuring the optimal configuration for the specific task. In addition, the hyperparameters

λ_{1}

and

λ_{2}

, which control the trade-off between the information entropy loss and the probability density estimation loss, were also tuned through grid search. The values of

λ_{1}

and

λ_{2}

were searched within the range of

[0.1, 1.0]

with a step size of 0.1. The optimal values of

λ_{1}

and

λ_{2}

were determined based on the performance of the model on the validation set, ensuring a balanced loss function that effectively incorporates both privacy protection and task performance. The grid search approach was essential for ensuring that the combination of

λ_{1}

and

λ_{2}

contributed effectively to the task-specific needs, without compromising either privacy or performance.

3.5. Baselines

To comprehensively evaluate the effectiveness of the proposed method, several state-of-the-art privacy protection techniques were selected as baselines, including FHE [15], DP [38], SMC [36], and FL [42]. The core concept of Fully Homomorphic Encryption is to perform computations directly on ciphertext without decryption, ensuring high security. Differential privacy protects privacy by adding noise to data or query results, which enhances the training and inference performance of the framework. Secure Multi-party Computation distributes data to different participants and jointly completes computations. After dividing the computation task into sub-tasks, the results are aggregated to perform the overall operation. For example, the distributed implementation of an addition task is

z = z_{1} + z_{2}

, where

z_{1} = x_{1} + y_{1}

and

z_{2} = x_{2} + y_{2}

. Federated learning updates a global model by training models locally and aggregating gradients from all participants, as defined by

θ_{t} = \frac{1}{K} \sum_{k = 1}^{K} θ_{t}^{k},

(32)

where

θ_{t}^{k}

denotes the parameter updates for the k-th node, and although it effectively avoids privacy risks associated with data centralization, it still faces challenges in model aggregation and complex scenarios. These methods address data privacy protection from various perspectives.

3.6. Evaluation Metrics

To evaluate the performance of the proposed method comprehensively, multiple important metrics were employed, including accuracy, precision, recall, and Frames Per Second (FPS). These metrics assess the model’s performance in different dimensions, focusing on both classification accuracy and the efficiency of model operation. Accuracy is defined as the ratio of correctly classified samples to the total number of samples:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(33)

where

T P

represents true positive samples,

T N

denotes true negative samples,

F P

is the false positive sample count, and

F N

is the false negative sample count. Accuracy reflects the overall correctness of the model, but it may be limited in cases of severely imbalanced classes, requiring the use of other metrics for comprehensive evaluation. Precision measures the proportion of predicted positive samples that are actually positive. Its mathematical definition is

Precision = \frac{T P}{T P + F P} .

(34)

A high precision indicates that the model’s predictions for positive classes are reliable and help reduce false positives. This is particularly important in medical imaging diagnostics, where the aim is to avoid misdiagnosing healthy samples as patients. Recall describes the proportion of actual positive samples that are correctly predicted by the model, defined as

Recall = \frac{T P}{T P + F N} .

(35)

Recall reflects the model’s ability to identify positive samples. A high recall means the model can identify as many actual positive samples as possible, which is crucial for disease screening tasks. For example, in cancer detection, a high recall minimizes the possibility of missed diagnoses. FPS is an important metric for assessing the efficiency of model execution, representing the number of samples the model can process per unit of time. Given that the model processes n samples in time t, FPS is defined as:

FPS = \frac{n}{t} .

(36)

A higher FPS indicates higher model efficiency, which is particularly crucial for real-time applications such as financial market predictions or dynamic medical image analysis.

4. Results and Discussion

4.1. Implementation Details and Configurations of Baseline Methods

FHE was implemented using the widely recognized HElib library, which supports arithmetic operations on encrypted data. The encryption parameters were set to ensure a balance between computational efficiency and security strength, with a key size of 2048 bits. For operations involving addition and multiplication, the ciphertext modulus was carefully chosen to avoid excessive noise accumulation. During the experiments, encrypted medical and financial data were processed directly without decryption, following the standard FHE workflow. For DP, we employed the Laplace mechanism to ensure

ε

-differential privacy, where

ε

was set to 0.5 for a moderate privacy–utility trade-off. Noise was added to each query result based on the sensitivity of the function being evaluated. For medical image classification, pixel intensity values were perturbed; for financial predictions, noise was applied to key features such as stock prices and trading volumes. The resulting data retained privacy guarantees but at the expense of reduced accuracy due to noise interference.

SMC was implemented using a custom setup in a distributed environment. Data were partitioned into shares and distributed among three parties, each performing partial computations. The computations were securely aggregated to reconstruct the final results. The communication protocol followed the GMW protocol, ensuring privacy during the intermediate steps. FL was implemented using a federated setup, where models were trained locally on multiple nodes. Each node processed a subset of the data and shared gradient updates with a central server. The Federated Averaging Algorithm was employed to aggregate these updates into a global model. To ensure privacy, we used differential privacy at the gradient level, introducing noise during transmission. The FL setup used five client nodes and a communication round limit of 50. All baseline methods were tested on the same hardware and software environment to ensure consistency. The experiments were conducted using an NVIDIA A100 GPU with 80 GB of memory, leveraging PyTorch 1.12 for implementation.

4.2. Medical Image Classification Results

The objective of this experiment was to evaluate the performance of different data protection strategies in medical image classification tasks, focusing on four key metrics: precision, recall, accuracy, and FPS. This study aimed to validate whether the proposed data obfuscation method based on probability density and information entropy could balance privacy protection with task performance. Comparisons were drawn with traditional methods, such as FHE, DP, Multi-party Computation (MPC), and FL, to provide optimized solutions for medical image analysis tasks.

As shown in Table 3, the performance of different models varied significantly across the four evaluation metrics. FHE demonstrated strong privacy protection capabilities but suffered from high computational complexity, resulting in lower classification performance (precision: 0.82; recall: 0.79; accuracy: 0.80) and a low processing speed (FPS: 23). While DP improved processing efficiency (FPS: 25) by introducing random noise, the added noise compromised the model’s classification performance (precision: 0.84; recall: 0.81; accuracy: 0.83). MPC leveraged data partitioning and distributed computation to enhance processing efficiency (FPS: 30), achieving moderate performance improvements (precision: 0.86; recall: 0.83; accuracy: 0.85). FL, which avoids direct data sharing, achieved higher classification accuracy (precision: 0.88; recall: 0.86; accuracy: 0.87) through global model aggregation, alongside significantly better processing efficiency (FPS: 45). In contrast, the proposed method achieved the best overall performance (precision: 0.93; recall: 0.89; accuracy: 0.91; FPS: 57), indicating that the data obfuscation strategy based on probability density and information entropy strikes a superior balance between privacy protection and task performance. Theoretically, the superior results of the proposed method stem from its innovative design. FHE relies on complex mathematical transformations, ensuring strong security but incurring significant computational costs, reflected in its low FPS value. DP’s approach of introducing random noise provides privacy protection but mathematically diminishes the validity of the data, leading to decreased classification performance. MPC employs secret-sharing protocols to improve task performance to some extent but is limited by high communication overhead in complex scenarios. FL avoids privacy breaches through distributed data modeling, and its global model aggregation strategy significantly enhances classification performance. However, FL may encounter inconsistencies in model aggregation under highly diverse data distributions. The proposed method models the global distribution characteristics of the data through the probability density extraction module and dynamically adjusts the obfuscation intensity via the information entropy fusion module. This ensures, mathematically, that the obfuscated data maintain global consistency and preserve the core features of the original data. The design avoids FHE’s high computational cost, mitigates DP’s noise-induced interference, and surpasses FL’s communication bottlenecks through a lightweight model architecture. Ultimately, its outstanding mathematical properties enable the proposed method to balance efficiency and accuracy in practical applications, providing new possibilities for medical image classification.

4.3. Financial Prediction Regression Results

This experiment aimed to evaluate the performance of different privacy protection strategies in financial prediction regression tasks, with a focus on analyzing the ability of each model to capture the dynamic characteristics in time-series data and their impact on privacy protection. Unlike medical image classification, time-series data emphasize the sequential and trend relationships between data points, posing higher demands on the balance between predictive accuracy and privacy protection. As shown in Table 4, the proposed method outperformed all other methods across all evaluation metrics (precision: 0.95; recall: 0.91; accuracy: 0.93; FPS: 54), demonstrating its strong adaptability and performance advantages in handling time-dependent data.

Theoretically, the characteristics of time-series data require models to effectively capture trends and changes over time. Traditional methods, such as DP and MPC, exhibit weaker capabilities in preserving such information due to noise interference or communication overhead, which reduce their ability to represent temporal patterns. While FL improves local modeling and global aggregation, it struggles to maintain the integrity of global time-series features in non-independent and identically distributed (Non-IID) data scenarios. The proposed method leverages the probability density extraction module to model the global distribution characteristics of time-series data, effectively capturing trend information. Combined with dynamic obfuscation adjustments through the information entropy fusion module, it ensures that the obfuscated data retain the ability to express critical temporal features. Furthermore, the performance loss term in the fusion loss function directly optimizes the prediction error for time-series data, enabling the model to balance the trade-off between privacy protection and temporal pattern preservation. These designs mathematically ensure the integrity of trend capturing in time-series data, providing a highly efficient and secure solution for financial prediction tasks.

4.4. Impact Analysis and Ablation Study on the Information Entropy Fusion Module

The ablation study on the information entropy fusion module aimed to validate its role in privacy protection and model performance, assessing its impact on the data obfuscation intensity under different entropy control strategies, as shown in Table 5. The experiment sets up multiple comparison groups, including removing the information entropy fusion module (No Entropy Fusion), using a fixed obfuscation intensity (Fixed Obfuscation), applying random obfuscation (Random Obfuscation), and employing Shannon entropy as the entropy metric (Shannon Entropy), while comparing these against the full model (Baseline Full Model). This study primarily evaluated six key metrics: reconstruction attack success rate (RASR), Wasserstein distance, classification accuracy, F1 score, mean squared error (MSE), and computation delay. RASR reflects the security of the obfuscated data, the Wasserstein distance measures the distributional similarity between the obfuscated and original data, while accuracy, F1 score, and MSE assess the model’s performance under different obfuscation strategies. The results demonstrate that the full model achieved the best balance across all metrics, indicating that the information entropy fusion module effectively adjusts the data obfuscation intensity, enhancing data security while preserving task utility as much as possible.

From a mathematical perspective, these results primarily stem from the ability of the information entropy fusion module to dynamically adjust the obfuscation intensity based on data uncertainty. The superior performance of the baseline group indicates that global information entropy effectively modulates the level of obfuscation across different regions of the data, achieving an optimal trade-off between privacy protection and data utility. In contrast, removing the information entropy fusion module (No Entropy Fusion) increases the RASR to 24.7%, suggesting that obfuscation effectiveness declines, while the Wasserstein distance increases to 0.87, indicating significant deviation from the original data distribution, which, in turn, degrades classification accuracy. Fixed Obfuscation reduces the RASR but fails to adaptively adjust the obfuscation intensity, leading to insufficient protection of critical data regions and a drop in accuracy to 88.2%. Random Obfuscation performs the worst, exhibiting the highest RASR (27.1%), which suggests that arbitrary perturbations disrupt the data structure, significantly impairing task performance. The Shannon entropy approach yields suboptimal performance—while it improves over No Entropy Fusion, it still lags behind the full model, likely due to the limited generalization ability of a single entropy metric in complex data distributions. Therefore, the experimental results strongly validate the effectiveness of the information entropy fusion module, demonstrating that the dynamically entropy-adjusted obfuscation strategy achieves a superior balance between data security and task utility.

4.5. Ablation Study on the Sensitivity of the Proposed Method to Non-Stationary Distributions in Time-Series Data

This experiment aimed to evaluate the stability of the proposed information entropy fusion-based obfuscation method under non-stationary time-series data conditions (i.e., concept drift), particularly in cases where data distribution undergoes gradual changes, sudden shifts, or periodic variations. The adaptability of the obfuscation strategy and its impact on both privacy protection and task performance were assessed. The experiment utilized financial time-series data and synthetic time-series data, simulating real-world applications through various types of concept drift. A systematic comparison was conducted between the full information entropy fusion model (Baseline), a fixed obfuscation strength (Fixed Obfuscation), no obfuscation (No Obfuscation), and random obfuscation (Random Obfuscation) to analyze the effectiveness of the information entropy fusion module. Different types of concept drift were simulated to examine the adaptability of the obfuscation strategy. Gradual Drift: The data distribution changes progressively over time, such as slow market trends rising or declining. Sudden Drift: A drastic change in data occurs at a specific time point, such as a black swan event in financial markets. Recurring Drift: The data distribution follows a periodic pattern, such as economic cycles or market fluctuations. Mixed Drift: A combination of gradual and sudden drift to simulate complex market dynamics, where long-term trends coexist with short-term disruptions.

The experimental results in Table 6 indicate that the full information entropy fusion model (Baseline) maintained a low reconstruction attack success rate of 11.8% and a high entropy change rate of 0.67 across all concept-drift scenarios, demonstrating that the method effectively adjusted the obfuscation intensity dynamically and preserved strong privacy protection, even when the data distribution shifted. Additionally, the baseline model achieved near-optimal performance in regression tasks (MSE = 0.045; RMSE = 0.212) and classification tasks (accuracy = 91.5%; precision = 0.88; recall = 0.86), indicating that it maintained predictive performance while ensuring privacy. In contrast, the fixed obfuscation strategy (Fixed Obfuscation), due to its lack of adaptation to time-series distribution drift, experienced an increase in RASR to 22.4% and a drop in the entropy change rate to 0.45, reducing privacy protection effectiveness, while its classification and regression performance also degraded (MSE = 0.078; accuracy = 86.2%). The random obfuscation strategy (Random Obfuscation) further exhibited instability, with the highest RASR (27.1%) and a significant drop in task performance (accuracy = 79.6%), indicating that random perturbations failed to maintain the statistical consistency of the data, thereby impairing the model’s learning ability. The no-obfuscation strategy (No Obfuscation) achieved the best task performance (accuracy = 94.3%) but exhibited the highest RASR (35.7%), posing a severe privacy leakage risk.

From a mathematical perspective, the superiority of the baseline model stems from the dynamic adjustment mechanism of the information entropy fusion module, where the obfuscation parameter

α_{t}

is adaptively regulated based on the time-series data entropy

H_{t}

. This ensures that the level of data obfuscation is adjusted in response to changes in the statistical properties of the data. In the gradual-drift scenario, this mechanism continuously adapts

α_{t}

to match long-term trend changes. In sudden-drift scenarios, abrupt entropy changes allow for the rapid adjustment of the obfuscation intensity, effectively countering drastic distribution shifts. In recurring-drift scenarios, the entropy metric captures periodic patterns in the data, balancing privacy protection and task performance. In contrast, the fixed obfuscation strategy, which lacks dynamic adaptation, leads to either insufficient or excessive obfuscation under concept-drift conditions. Since it does not consider data-specific characteristics, the random obfuscation strategy results in suboptimal privacy protection and predictive performance. Consequently, the experimental results confirm the effectiveness of the information entropy fusion module in non-stationary time-series environments and demonstrate its ability to dynamically balance privacy protection and task performance.

4.6. Discussion on Throughput

This study comprehensively compares the throughput (FPS) of different data protection strategies across two tasks: medical image classification and financial prediction regression. The experimental results clearly reveal the differences in throughput and their underlying reasons. Throughput directly reflects the operational efficiency of a model and serves as a crucial metric for evaluating the practical applicability of privacy protection techniques. The results show that traditional methods, such as FHE and DP, have significantly lower throughput compared to other methods due to their inherent design characteristics. FHE’s high computational complexity in encryption and decryption processes results in FPS values of only 23 and 25 for medical image classification and financial prediction, respectively, making it unsuitable for real-time tasks. DP achieves a slight improvement in throughput by introducing noise but remains constrained by the computational overhead of noise generation. In contrast, MPC improves computational efficiency through distributed processing, raising throughput to 30 and 35 for the two tasks. However, its communication overhead poses a major limitation in complex scenarios. Federated learning significantly enhances throughput to 45 in both tasks by leveraging local modeling and avoiding direct data transmission. Nevertheless, the proposed method achieves the highest throughput, reaching 57 and 54 for medical image classification and financial prediction, respectively. This improvement is attributed to its lightweight design, including the probability density extraction module and the information entropy fusion module, as well as the optimization of the fusion loss function, which reduces the computational complexity of data obfuscation. This high throughput not only highlights the superiority of the proposed method’s theoretical design but also demonstrates its practical value in scenarios requiring efficient processing of large-scale data.

4.7. Ablation Study on Different Data Obfuscation Methods

This experiment aimed to validate the advantages of the proposed method in medical image classification and financial prediction regression tasks through an ablation study on different data obfuscation strategies. Specifically, the experiment compared obfuscation methods based on gradient information, pure Gaussian noise, and the proposed strategy based on probability density and information entropy, as shown in Figure 5.

The goal was to comprehensively analyze the impact of these methods on privacy protection strength, task performance, and operational efficiency. As shown in Table 7, the gradient-based obfuscation method performed poorly in both tasks, achieving precision, recall, and accuracy of 0.69, 0.64, and 0.67, respectively, for medical image classification, with a throughput of 32 FPS. The corresponding metrics for financial prediction were 0.71, 0.67, and 0.69, with a throughput of 41 FPS. In contrast, the pure Gaussian noise method significantly improved throughput (FPS), reaching 48 and 49 for medical image classification and financial prediction, respectively. However, its disruptive effect on data distribution led to a decline in performance, weakening task-specific characteristics. The proposed method achieved the best results in both tasks, with precision, recall, and accuracy of 0.93, 0.89, and 0.91 for medical image classification and 0.95, 0.91, and 0.93 for financial prediction, while also achieving the highest FPS (57 for medical image classification and 54 for financial prediction). This demonstrates its comprehensive advantages in task performance and operational efficiency.

From a theoretical perspective, the differences in experimental results stem from the varying capabilities of these obfuscation strategies to preserve data features while protecting privacy. The gradient-based obfuscation method achieves privacy protection by introducing gradient perturbations, but its reliance on task model gradient updates makes it susceptible to gradient instability and noise accumulation. This significantly weakens its ability to retain the core features of the original data, particularly in image classification and time-series prediction tasks, resulting in lower task performance. The pure Gaussian noise method relies on random sampling from a Gaussian distribution, which provides substantial privacy protection. However, the global nature of the noise disrupts the alignment between the obfuscated and original data distributions, reducing the model’s ability to capture critical features. The proposed method, on the other hand, leverages the probability density extraction module to model the global distribution of the data, while the information entropy fusion module dynamically adjusts the obfuscation intensity. Mathematically, this ensures that the statistical properties of the obfuscated data align closely with those of the original data, while sensitive information is selectively protected. Additionally, the lightweight architecture of the proposed method effectively reduces computational complexity, further improving throughput. This multi-module collaborative optimization design demonstrates strong adaptability and advantages across both task scenarios, providing a robust theoretical and practical framework for privacy preservation and efficient data analysis.

5. Conclusions

As the demand for data privacy protection continues to grow, balancing privacy preservation and task performance has become a significant challenge in data analysis. Traditional methods such as FHE, DP, MPC, and FL have achieved notable success in privacy protection and data utility. However, these approaches face significant limitations, including high computational complexity, noise interference, and communication overhead. To address these issues, this paper proposes a novel data obfuscation method based on probability density and information entropy, with a design that directly addresses the trade-offs observed in traditional techniques. By integrating a probability density extraction module and an information entropy fusion module, the method models the global distribution of data and dynamically adjusts the obfuscation intensity. Additionally, a fusion loss function is designed to optimize the balance between privacy protection and task performance. Experimental results provide strong evidence for the method’s effectiveness. In medical image classification, the proposed method achieved precision, recall, and accuracy of 0.93, 0.89, and 0.91, respectively, with a throughput of 57 FPS, demonstrating its ability to preserve data utility while ensuring privacy. Compared to FHE (precision: 0.82; throughput: 23 FPS) and DP (precision: 0.84; throughput: 25 FPS), the method achieved clear improvements by reducing computational overhead and noise-induced performance degradation. Similarly, in financial prediction tasks, the proposed method exhibited excellent capability in capturing temporal features, achieving precision, recall, and accuracy of 0.95, 0.91, and 0.93, with a throughput of 54 FPS, far surpassing traditional methods. These results highlight not only the superior performance of the proposed modules but also their ability to adapt to different types of data, such as medical images and time-series data, effectively addressing privacy and utility concerns in diverse scenarios.

Furthermore, ablation experiments provided detailed insights into the contributions of individual components, such as the probability density extraction module and the information entropy fusion module, to the overall system performance. Specifically, the experiments confirmed that the dynamic adjustment of the obfuscation intensity significantly enhances privacy protection while maintaining task relevance. Compared with gradient-based obfuscation and pure Gaussian noise methods, the proposed method preserves statistical characteristics more effectively, ensuring that the obfuscated data remain analytically valid for downstream tasks. In conclusion, the integration of probability density modeling and dynamic obfuscation adjustment offers a robust framework for balancing privacy and performance. The superior experimental results, supported by ablation studies and comparative analyses, underscore the effectiveness of this approach in real-world applications. These findings lay a solid foundation for extending the proposed method to other privacy-sensitive domains while further optimizing its efficiency and scalability.

Author Contributions

Conceptualization, H.C., C.Q., L.C., M.R. and C.L.; Data curation, S.L., X.Z., H.W. and C.L.; Formal analysis, J.X.; Funding acquisition, C.L.; Investigation, S.L.; Methodology, H.C., C.Q., L.C. and C.L.; Resources, S.L., X.Z. and H.W.; Software, H.C., C.Q., L.C., J.X. and M.R.; Supervision, M.R.; Validation, J.X.; Visualization, X.Z. and H.W.; Writing—original draft, H.C., C.Q., L.C., J.X., S.L., X.Z., H.W. and M.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 61202479.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sriram, G.; Sriram, G. Security challenges of big data computing. Int. Res. J. Mod. Eng. Technol. Sci. 2022, 4, 1164–1171. [Google Scholar]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Zhang, L.; Lv, C. Automatic plant disease detection based on tranvolution detection network with GAN modules using leaf images. Front. Plant Sci. 2022, 13, 875693. [Google Scholar] [CrossRef] [PubMed]
Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2020, 2, 305–311. [Google Scholar] [CrossRef]
Thapa, C.; Camtepe, S. Precision health data: Requirements, challenges and existing techniques for data security and privacy. Comput. Biol. Med. 2021, 129, 104130. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, X.; Liu, Y.; Zhou, J.; Huang, Y.; Li, J.; Zhang, L.; Ma, Q. A time-series neural network for pig feeding behavior recognition and dangerous detection from videos. Comput. Electron. Agric. 2024, 218, 108710. [Google Scholar] [CrossRef]
Hasan, M.K.; Ghazal, T.M.; Saeed, R.A.; Pandey, B.; Gohel, H.; Eshmawi, A.; Abdel-Khalek, S.; Alkhassawneh, H.M. A review on security threats, vulnerabilities, and counter measures of 5G enabled Internet-of-Medical-Things. IET Commun. 2022, 16, 421–432. [Google Scholar] [CrossRef]
Hamza, R.; Pradana, H. A survey of intellectual property rights protection in big data applications. Algorithms 2022, 15, 418. [Google Scholar] [CrossRef]
Chaudhry, U.H.; Arshad, R.; Abbas, N.N.; Zeerak, A.A. Data Leakage in Business and FinTech. In Handbook of Research on Cybersecurity Issues and Challenges for Business and FinTech Applications; IGI Global: Hershey, PA, USA, 2023; pp. 192–207. [Google Scholar]
Zou, M. Market Manipulation: Identification, Characteristics and Prevention. Ph.D. Thesis, Macquarie University, Macquarie Park, Australia, 2022. [Google Scholar]
Goldberg, S.G.; Johnson, G.A.; Shriver, S.K. Regulating privacy online: An economic evaluation of the GDPR. Am. Econ. J. Econ. Policy 2024, 16, 325–358. [Google Scholar] [CrossRef]
Citron, D.K.; Solove, D.J. Privacy harms. BUL Rev. 2022, 102, 793. [Google Scholar] [CrossRef]
Nayak, S.K.; Ojha, A.C. Data leakage detection and prevention: Review and research directions. In Machine Learning and Information Processing: Proceedings of ICMLIP 2019; Springer: Berlin/Heidelberg, Germany, 2020; pp. 203–212. [Google Scholar]
Xu, Y.; Xu, G.; Liu, Y.; Liu, Y.; Shen, M. A survey of the fusion of traditional data security technology and blockchain. In Expert Systems with Applications; Elsevier: Amsterdam, The Netherlands, 2024; p. 124151. [Google Scholar]
Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, Maryland, 31 May–2 June 2009; pp. 169–178. [Google Scholar]
Chen, J.; Li, K.; Philip, S.Y. Privacy-preserving deep learning model for decentralized vanets using fully homomorphic encryption and blockchain. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11633–11642. [Google Scholar] [CrossRef]
Marcolla, C.; Sucasas, V.; Manzano, M.; Bassoli, R.; Fitzek, F.H.; Aaraj, N. Survey on fully homomorphic encryption, theory, and applications. Proc. IEEE 2022, 110, 1572–1609. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Dong, J.; Roth, A.; Su, W.J. Gaussian differential privacy. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2022, 84, 3–37. [Google Scholar] [CrossRef]
Adnan, M.; Kalra, S.; Cresswell, J.C.; Taylor, G.W.; Tizhoosh, H.R. Federated learning and differential privacy for medical image analysis. Sci. Rep. 2022, 12, 1953. [Google Scholar] [CrossRef] [PubMed]
Al-Garadi, M.A.; Mohamed, A.; Al-Ali, A.K.; Du, X.; Ali, I.; Guizani, M. A survey of machine and deep learning methods for internet of things (IoT) security. IEEE Commun. Surv. Tutor. 2020, 22, 1646–1685. [Google Scholar] [CrossRef]
Cai, Z.; Xiong, Z.; Xu, H.; Wang, P.; Li, W.; Pan, Y. Generative adversarial networks: A survey toward private and secure applications. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Gupta, R.; Tanwar, S.; Tyagi, S.; Kumar, N. Machine learning models for secure data analytics: A taxonomy and threat model. Comput. Commun. 2020, 153, 406–440. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Yoon, J.; Drumright, L.N.; Van Der Schaar, M. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE J. Biomed. Health Inform. 2020, 24, 2378–2388. [Google Scholar] [CrossRef] [PubMed]
Amanullah, M.A.; Habeeb, R.A.A.; Nasaruddin, F.H.; Gani, A.; Ahmed, E.; Nainar, A.S.M.; Akim, N.M.; Imran, M. Deep learning and big data technologies for IoT security. Comput. Commun. 2020, 151, 495–517. [Google Scholar] [CrossRef]
Chen, D.; Wawrzynski, P.; Lv, Z. Cyber security in smart cities: A review of deep learning-based applications and case studies. Sustain. Cities Soc. 2021, 66, 102655. [Google Scholar] [CrossRef]
Ferrag, M.A.; Friha, O.; Maglaras, L.; Janicke, H.; Shu, L. Federated deep learning for cyber security in the internet of things: Concepts, applications, and experimental analysis. IEEE Access 2021, 9, 138509–138542. [Google Scholar] [CrossRef]
Kshirsagar, A.; Shah, M. Anatomized study of security solutions for multimedia: Deep learning-enabled authentication, cryptography and information hiding. In Advanced Security Solutions for Multimedia; IOP Publishing: Bristol, UK, 2021; pp. 7-1–7-26. [Google Scholar]
Rahman, A.; Hossain, M.S.; Alrajeh, N.A.; Alsolami, F. Adversarial examples—Security threats to COVID-19 deep learning systems in medical IoT devices. IEEE Internet Things J. 2020, 8, 9603–9610. [Google Scholar] [CrossRef]
Liu, X.; Xie, L.; Wang, Y.; Zou, J.; Xiong, J.; Ying, Z.; Vasilakos, A.V. Privacy and security issues in deep learning: A survey. IEEE Access 2020, 9, 4566–4593. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar] [CrossRef]
Yousuf, H.; Lahzi, M.; Salloum, S.A.; Shaalan, K. Systematic review on fully homomorphic encryption scheme and its application. In Recent Advances in Intelligent Systems and Smart Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 537–551. [Google Scholar]
Li, Q.; Zhang, Y. Confidential Federated Learning for Heterogeneous Platforms against Client-Side Privacy Leakages. In Proceedings of the ACM Turing Award Celebration Conference 2024, Changsha, China, 5–7 July 2024; pp. 239–241. [Google Scholar]
Lee, J.W.; Kang, H.; Lee, Y.; Choi, W.; Eom, J.; Deryabin, M.; Lee, E.; Lee, J.; Yoo, D.; Kim, Y.S.; et al. Privacy-preserving machine learning with fully homomorphic encryption for deep neural network. IEEE Access 2022, 10, 30039–30054. [Google Scholar] [CrossRef]
Yao, A.C. Protocols for secure computations. In Proceedings of the IEEE 23rd Annual Symposium on Foundations of Computer Science (SFCS 1982), Washington, DC, USA, 3–5 November 1982; pp. 160–164. [Google Scholar]
Knott, B.; Venkataraman, S.; Hannun, A.; Sengupta, S.; Ibrahim, M.; van der Maaten, L. Crypten: Secure multi-party computation meets machine learning. Adv. Neural Inf. Process. Syst. 2021, 34, 4961–4973. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, 4–7 March 2006; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farokhi, F.; Jin, S.; Quek, T.Q.; Poor, H.V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [Google Scholar] [CrossRef]
Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the IEEE 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
Li, Q.; Zhang, Y.; Ren, J.; Li, Q.; Zhang, Y. You Can Use But Cannot Recognize: Preserving Visual Privacy in Deep Neural Networks. arXiv 2024, arXiv:2404.04098. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics. PMLR, Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards Generalist Foundation Model for Radiology. arXiv 2023, arXiv:2308.02463. [Google Scholar]
COVID-19 CT Segmentation Dataset 2020. 2020. Available online: http://medicalsegmentation.com/covid19/ (accessed on 22 August 2021).
National Library of Medicine. Open-i: An Open Access Biomedical Search Engine. 2024. Available online: https://openi.nlm.nih.gov/ (accessed on 16 January 2024).
Trama, D.; Clet, P.E.; Boudguiga, A.; Sirdey, R. Designing a General-Purpose 8-bit (T) FHE Processor Abstraction. In Cryptology ePrint Archive; IACR: New York, NY, USA, 2024. [Google Scholar]
Wang, H.; Gao, S.; Zhang, H.; Su, W.; Shen, M. Dp-hypo: An adaptive private framework for hyperparameter optimization. Adv. Neural Inf. Process. Syst. 2024, 36, 389–412. [Google Scholar]
Goyal, V.; Liu-Zhang, C.D.; Song, Y. Towards achieving asynchronous MPC with linear communication and optimal resilience. In Annual International Cryptology Conference; Springer: Cham, Switzerland, 2024; pp. 170–206. [Google Scholar]
Yazdinejad, A.; Dehghantanha, A.; Karimipour, H.; Srivastava, G.; Parizi, R.M. A robust privacy-preserving federated learning model against model poisoning attacks. IEEE Trans. Inf. Forensics Secur. 2024, 19, 6693–6708. [Google Scholar] [CrossRef]

Figure 1. Visualization of different image data enhancement methods: (a) rotation; (b) salt-and-pepper noise; (c) contrast adjustment; (d) shearing; (e) cropping; (f) flipping.

Figure 2. Overview of the proposed method.

Figure 3. Architecture of the data obfuscation strategy proposed in this paper.

Figure 4. Architecture of the probability density extraction module.

Figure 5. Visualization of different data obfuscation methods: (a) obfuscation method based on gradient information; (b,c) two methods based on Gaussian noise; (d) the proposed method.

Table 1. Data quantities for different medical imaging types.

Image Data Type	Number
X-ray	5912
CT scan	9649
MRI scan	7345

Table 2. List of financial indicators considered in data collection.

Category	Financial Indicators
Stock Market	Opening Price, Closing Price, High Price, Low Price, Trading Volume, Market Capitalization
Volatility	Beta Coefficient, Standard Deviation of Returns, Average True Range (ATR)
Liquidity	Bid–Ask Spread, Turnover Ratio, Trading Frequency
Fundamentals	Earnings Per Share (EPS), Price-to-Earnings Ratio (P/E), Price-to-Book Ratio (P/B),
	Return on Equity (ROE), Dividend Yield, Free Cash Flow (FCF)
Macroeconomic	Interest Rate, Inflation Rate, GDP Growth Rate, Unemployment Rate
Sentiment	News Sentiment Score, Social Media Sentiment Index

Table 3. Medical image classification results.

Model	Precision	Recall	Accuracy	FPS
FHE [46]	0.82	0.79	0.80	23
DP [47]	0.84	0.81	0.83	25
MPC [48]	0.86	0.83	0.85	30
FL [49]	0.88	0.86	0.87	45
Proposed Method	0.93	0.89	0.91	57

Table 4. Financial prediction regression results.

Model	Precision	Recall	Accuracy	FPS
FHE [46]	0.85	0.82	0.83	25
DP [47]	0.87	0.84	0.86	28
MPC [48]	0.89	0.86	0.88	35
FL [49]	0.92	0.89	0.90	45
Proposed Method	0.95	0.91	0.93	54

Table 5. Ablation study on the information entropy fusion module.

Experiment Group	RASR (%)	Wasserstein Dist.	Accuracy	F1 Score	MSE	Delay (s)
Baseline (Full Model)	12.3	0.54	92.1	0.89	0.042	1.32
No Entropy Fusion	24.7	0.87	85.3	0.81	0.097	1.18
Fixed Obfuscation	19.8	0.74	88.2	0.85	0.063	1.24
Random Obfuscation	27.1	1.02	80.7	0.76	0.132	1.08
Shannon Entropy	14.6	0.60	91.0	0.87	0.049	1.41

Table 6. Ablation study on the sensitivity of the proposed method to non-stationary distributions in time-series data.

Experiment Group	RASR (%)	Entropy Change Rate	MSE	RMSE	Acc.	Pre.
Baseline (Full Model)	11.8	0.67	0.045	0.212	91.5	0.88
No Obfuscation	35.7	0.02	0.032	0.178	94.3	0.92
Fixed Obfuscation	22.4	0.45	0.078	0.264	86.2	0.81
Random Obfuscation	27.1	0.30	0.093	0.289	79.6	0.74
Gradual-Drift Adaptation	14.2	0.61	0.048	0.219	90.8	0.87
Sudden-Drift Adaptation	15.9	0.58	0.052	0.227	89.6	0.85
Recurring-Drift Adaptation	13.5	0.63	0.046	0.215	91.2	0.88
Mixed-Drift Adaptation	16.7	0.56	0.057	0.231	88.9	0.84

Table 7. Ablation study results.

Model	Precision	Recall	Accuracy	FPS
Gradient-based Obfuscation (Medical)	0.69	0.64	0.67	32
Pure Gaussian Noise (Medical)	0.82	0.76	0.79	48
Proposed Method (Medical)	0.93	0.89	0.91	57
Gradient-based Obfuscation (Financial)	0.71	0.67	0.69	41
Pure Gaussian Noise (Financial)	0.85	0.80	0.83	49
Proposed Method (Financial)	0.95	0.91	0.93	54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, H.; Qiang, C.; Cong, L.; Xiao, J.; Liu, S.; Zhou, X.; Wang, H.; Ruan, M.; Lv, C. A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation. Appl. Sci. 2025, 15, 1261. https://doi.org/10.3390/app15031261

AMA Style

Cheng H, Qiang C, Cong L, Xiao J, Liu S, Zhou X, Wang H, Ruan M, Lv C. A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation. Applied Sciences. 2025; 15(3):1261. https://doi.org/10.3390/app15031261

Chicago/Turabian Style

Cheng, Haolan, Chenyi Qiang, Lin Cong, Jingze Xiao, Shiya Liu, Xingyu Zhou, Huijun Wang, Mingzhuo Ruan, and Chunli Lv. 2025. "A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation" Applied Sciences 15, no. 3: 1261. https://doi.org/10.3390/app15031261

APA Style

Cheng, H., Qiang, C., Cong, L., Xiao, J., Liu, S., Zhou, X., Wang, H., Ruan, M., & Lv, C. (2025). A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation. Applied Sciences, 15(3), 1261. https://doi.org/10.3390/app15031261

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation

Abstract

1. Introduction

2. Related Works

2.1. Encryption-Based Security Methods

2.2. Data Obfuscation-Based Security Methods

3. Materials and Method

3.1. Dataset Collection

3.1.1. Image Data

3.1.2. Financial Data

3.2. Data Augmentation

3.2.1. Financial Dataset Imputation and Cleaning

3.2.2. Medical Image Data Augmentation

3.3. Proposed Method

3.3.1. Data Obfuscation Strategy Based on Probability Density and Information Entropy

3.3.2. Probability Density Extraction Module

3.3.3. Fusion Loss Function

3.4. Experimental Setup

3.4.1. Hardware and Software Platforms

3.4.2. Optimizer Configuration and Hyperparameter Settings

3.5. Baselines

3.6. Evaluation Metrics

4. Results and Discussion

4.1. Implementation Details and Configurations of Baseline Methods

4.2. Medical Image Classification Results

4.3. Financial Prediction Regression Results

4.4. Impact Analysis and Ablation Study on the Information Entropy Fusion Module

4.5. Ablation Study on the Sensitivity of the Proposed Method to Non-Stationary Distributions in Time-Series Data

4.6. Discussion on Throughput

4.7. Ablation Study on Different Data Obfuscation Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI