Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain
Round 1
Reviewer 1 Report
The paper describes a new framework for the secondary sharing of personal data using synthetic data generation. The main problem with the paper is that it completely neglects current research on synthetic data, particularly research with implications on the applicability of their framework. The points below elaborate this concern from multiple facets:
- The whole approach relies on the fact that models built on synthetic data (SD) will be eventually useful (after multiple iterations of testing on real data and regenerating SD). However, the authors do not provide any evidence to support the accuracy of models built using their framework
- It has been proven in prior work that, when running multiple classifiers on SD and real data, the best classifier (in terms of accuracy) on SD and real do not always match (particularly for the SDG chosen for this study: SDV). So how will this affect the framework? (check A Multi-Dimensional Evaluation of Synthetic Data Generators in IEEE Access).
- Users are allowed to repeat their analysis on another SD if the results are not satisfactory. There are multiple issues with that:
- On what basis is a new SD chosen? (is there a utility measure to guide the choice? if yes, please elaborate and justify the choice of the measure? (check: Generation and evaluation of synthetic patient data in BMC MRM)
- Related to the previous point, what is the stopping criteria? Without using a utility measure to guide SD generation, this is not clear, the process may never converge
- What is the effect of repeating the analysis multiple times on the scientists? (particularly when it is not clear whether prior analysis can be used, or when the process will converge)
- What is the effect of this iterative framework on privacy?
- The author performed one experiment to prove the usefulness of their methodology, this is not enough particularly when no evidence is supplied.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The paper aims to demonstrate a pipeline through which the adoption of the synthetic data generation techniques can be applied for real-world applications particularly to address issues such as how to augment Real Data for training different ML models and data privacy/security issues. The paper is well presented (except one aspect which is mentioned below) and results are useful from real-life application point of view.
Using a test-bed problem (heart rate measurements), forecasting analysis with SD and remote forecasting analysis with RD, the efficiency and utility of the complete workflow has been demonstrated and validated.
The authors need to include a complete section on how the Synthetic Data has been generated for this test-bed problem and the associated mathematics. There are several methods to generate synthetic data (time-series data), however authors have not mentioned much about it before using the SD and performing comparisons between synthetic data and real data. Couple of recent articles which can be looked into to address this issue: 1) white-box model ( An Application of Machine Learning for Plasma Current Quench Studies via Synthetic Data Generation): https://doi.org/10.1016/j.fusengdes.2021.112578
2) ML based model (Generative Adversarial Network for Synthetic Time Series Data Generation in Smart Grids): 10.1109/SmartGridComm.2018.8587464
The quality of the SD will depend on which method has been used to generate the synthetic data. General statistical measures to establish that generated synthetic data is valuable only after authors clearly mention how the SD has been generated. Currently, this is the main weakness of the paper and once addressed it would make it a good read.
In Fig.-1 - authors may explain in the text what "check results" actually mean and how it is achieved.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
- It is more of an engineering paper to help out with the research. Therefore, certain components need to be added to the visuals and discussions.
- A comparison with similar systems as far as performance and/or results could help researchers a lot.
- It is mentioned about SDG in other fields such as education which may have similar privacy concerns, however, it is not much discussed. Comparison to other fields is always very helpful.
- Architectural model of the tool is discussed. However, the visuals are not following standard software architectures which is ok. But, it seems that with the help of colors and dashed vs. dark lines show how the system workflow and data flow is. DFD might be a good idea to show the data flow. Also, on the diagrams, it is needed to clarify what the different colors mean as well as dashed vs. dark lines (thick and thin ones).
- Performance analysis of the system based on certain type hardware and size of data.
- Discussion on reliability is mainly on the privacy and security preserving. However, the method reliability can be verified through statistical analysis with running many different experiments to make sure the model is robust.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors responded to most of my comments. The issue of stopping criteria is still problematic.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Authors have addressed the important issues in the revised version and the presentation quality has also improved.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf