Submit to Electronics Review for Electronics Propose a Special Issue

Journal Menu

Journal Browser

Recent Advances in Synthetic Data Generation

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (31 December 2022) | Viewed by 37639

Share This Special Issue

Special Issue Editors

Dr. Gorka Epelde Unanue

E-Mail Website
Guest Editor

1. Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), 20009 Donostia-San Sebastián, Spain
2. Biodonostia Health Research Institute, eHealth Group, Paseo Doctor Begiristain, s/n, 20014 San Sebastián, Spain
Interests: health; software; data; network; data preparation; QoD; Synthetic data generation for data security / privacy

Dr. Darryl Charles

E-Mail Website
Guest Editor

School of Computing, Engineering and Intelligent Systems, Ulster University, Derry~Londonderry, UK
Interests: patient rehabilitation; virtual reality; artificial intelligence; computer games

Special Issue Information

Dear Colleagues,

Scientific and technological advances in recent decades have led to the digitization and increased generation and collection of data describing real-world applications or processes. In addition, machine learning models and artificial intelligence applications built on data have been proven to improve management and decision making about these applications and processes.

Despite the potential of data-based solutions, there are many issues that prevent or delay the development of such solutions. The most notable issues are the access to data, and the captured sample’s representativeness of the real population. Access to real data can be delayed or even prevented for various reasons such as privacy, security and intellectual property, or required (quality) capturing and preparation technology development. Sample representativeness is another critical issue that relates to class imbalance and representation of rare and extreme events, which is critical for ML model performance.

Synthetic data (SD) is described in this context as “any production data applicable to a given situation that are not obtained by direct measurement”. SD has three key use cases: (i) data augmentation: to balance datasets or supplement available data before training an ML model; (ii) privacy-preservation: to allow safe and private sharing of sensitive data; (iii) simulation: estimating and teaching systems in situations that haven’t been observed in actual reality.

The need for a comprehensive solution to exploit developments in Big Data and AI technology has never been greater, and synthetic data generation (SDG) research has been underway for some time with promising results in various application areas, including healthcare, cybersecurity, industrial processes, and energy consumption. Research has addressed the SDG of different data modalities (written natural language, images, video, tabular data, time series data, etc.) using different technological approaches.

The main objective of this Special Issue is to bring together diverse, novel and impactful research on synthetic data generation, thereby accelerating research in this field and the adoption of these techniques for real-world applications.

Contributions from different application domains, use cases and data modalities are sought by this Special Issue.

Submissions should be of high enough quality for an international journal and should not be submitted or published elsewhere. However, the extended versions of conference papers that show significant improvement (minimum of over 30%) can be considered for review in this Special Issue. In addition, we welcome review papers covering the subjects of this Special Issue.

Technical Program Committee Members:

Dr. Debbie Rankin - Ulster University
Dr. Ane Alberdi – Mondragon Unibertsitatea
Dr. Rodrigo Cilla – Vicomtech - BRTA

Dr. Gorka Epelde Unanue
Dr. Darryl Charles
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

Synthetic data generation
Generative adversarial networks
Privacy preserving data
Data augmentation
Artificial intelligence
Healthcare
Imbalanced learning

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (6 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

17 pages, 9026 KiB

Open AccessArticle

Nonparametric Generation of Synthetic Data Using Copulas

by Juan P. Restrepo, Juan Carlos Rivera, Henry Laniado, Pablo Osorio and Omar A. Becerra

Electronics 2023, 12(7), 1601; https://doi.org/10.3390/electronics12071601 - 29 Mar 2023

Cited by 1 | Viewed by 3474

Abstract

This article presents a novel nonparametric approach to generate synthetic data using copulas, which are functions that explain the dependency structure of the real data. The proposed method addresses several challenges faced by existing synthetic data generation techniques, such as the preservation of complex multivariate structures presented in real data. By using all the information from real data and verifying that the generated synthetic data follows the same behavior as the real data under homogeneity tests, our method is a significant improvement over existing techniques. Our method is easy to implement and interpret, making it a valuable tool for solving class imbalance problems in machine learning models, improving the generalization capabilities of deep learning models, and anonymizing information in finance and healthcare domains, among other applications. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Figure 1

17 pages, 2135 KiB

Open AccessArticle

A Novel Fusion Approach Consisting of GAN and State-of-Charge Estimator for Synthetic Battery Operation Data Generation

by Kei Long Wong, Ka Seng Chou, Rita Tse, Su-Kit Tang and Giovanni Pau

Electronics 2023, 12(3), 657; https://doi.org/10.3390/electronics12030657 - 28 Jan 2023

Cited by 7 | Viewed by 3043

Abstract

The recent success of machine learning has accelerated the development of data-driven lithium-ion battery state estimation and prediction. The lack of accessible battery operation data is one of the primary concerns with the data-driven approach. However, research on battery operation data augmentation is rare. When coping with data sparsity, one popular approach is to augment the dataset by producing synthetic data. In this paper, we propose a novel fusion method for synthetic battery operation data generation. It combines a generative, adversarial, network-based generation module and a state-of-charge estimator. The generation module generates battery operation features, namely the voltage, current, and temperature. The features are then fed into the state-of-charge estimator, which calculates the relevant state of charge. The results of the evaluation reveal that our method can produce synthetic data with distributions similar to the actual dataset and performs well in downstream tasks. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Figure 1

15 pages, 868 KiB

Open AccessArticle

Statistical Validation of Synthetic Data for Lung Cancer Patients Generated by Using Generative Adversarial Networks

by Luis Gonzalez-Abril, Cecilio Angulo, Juan Antonio Ortega and José-Luis Lopez-Guerra

Electronics 2022, 11(20), 3277; https://doi.org/10.3390/electronics11203277 - 12 Oct 2022

Cited by 5 | Viewed by 2715

Abstract

The development of healthcare patient digital twins in combination with machine learning technologies helps doctors in therapeutic prescription and in minimally invasive intervention procedures. The confidentiality of medical records or limited data availability in many health domains are drawbacks that can be overcome with the generation of synthetic data conformed to real data. The use of generative adversarial networks (GAN) for the generation of synthetic data of lung cancer patients has been previously introduced as a tool to solve this problem in the form of anonymized synthetic patients. However, generated synthetic data are mainly validated from the machine learning domain (loss functions) or expert domain (oncologists). In this paper, we propose statistical decision making as a validation tool: Is the model good enough to be used? Does the model pass rigorous hypothesis testing criteria? We show for the case at hand how loss functions and hypothesis validation are not always well aligned. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Figure 1

21 pages, 9663 KiB

Open AccessArticle

The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive, Synthetic Health Record

by Jason Walonoski, Dylan Hall, Karen M. Bates, M. Heath Farris, Joseph Dagher, Matthew E. Downs, Ryan T. Sivek, Ben Wellner, Andrew Gregorowicz, Marc Hadley, Francis X. Campion, Lauren Levine, Kevin Wacome, Geoff Emmer, Aaron Kemmer, Maha Malik, Jonah Hughes, Eldesia Granger and Sybil Russell

Electronics 2022, 11(8), 1199; https://doi.org/10.3390/electronics11081199 - 9 Apr 2022

Cited by 6 | Viewed by 11825

Abstract

The “Coherent Data Set” is a novel synthetic data set that leverages structured data from Synthea™ to create a longitudinal, “coherent” patient-level electronic health record (EHR). Comprised of synthetic patients, the Coherent Data Set is publicly available, reproducible using Synthea™, and free of the privacy risks that arise from using real patient data. The Coherent Data Set provides complex and representative health records that can be leveraged by health IT professionals without the risks associated with de-identified patient data. It includes familial genomes that were created through a simulation of the genetic reproduction process; magnetic resonance imaging (MRI) DICOM files created with a voxel-based computational model; clinical notes in the style of traditional subjective, objective, assessment, and plan notes; and physiological data that leverage existing System Biology Markup Language (SBML) models to capture non-linear changes in patient health metrics. HL7 Fast Healthcare Interoperability Resources (FHIR^®) links the data together. The models can generate clinically logical health data, but ensuring clinical validity remains a challenge without comparable data to substantiate results. We believe this data set is the first of its kind and a novel contribution to practical health interoperability efforts. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Figure 1

10 pages, 1626 KiB

Open AccessArticle

MaWGAN: A Generative Adversarial Network to Create Synthetic Data from Datasets with Missing Data

by Thomas Poudevigne-Durance, Owen Dafydd Jones and Yipeng Qin

Electronics 2022, 11(6), 837; https://doi.org/10.3390/electronics11060837 - 8 Mar 2022

Cited by 6 | Viewed by 3071

Abstract

The creation of synthetic data are important for a range of applications, for example, to anonymise sensitive datasets or to increase the volume of data in a dataset. When the target dataset has missing data, then it is common to just discard incomplete observations, even though this necessarily means some loss of information. However, when the proportion of missing data are large, discarding incomplete observations may not leave enough data to accurately estimate their joint distribution. Thus, there is a need for data synthesis methods capable of using datasets with missing data, to improve accuracy and, in more extreme cases, to make data synthesis possible. To achieve this, we propose a novel generative adversarial network (GAN) called MaWGAN (for masked Wasserstein GAN), which creates synthetic data directly from datasets with missing values. As with existing GAN approaches, the MaWGAN synthetic data generator generates samples from the full joint distribution. We introduce a novel methodology for comparing the generator output with the original data that does not require us to discard incomplete observations, based on a modification of the Wasserstein distance and easily implemented using masks generated from the pattern of missing data in the original dataset. Numerical experiments are used to demonstrate the superior performance of MaWGAN compared to (a) discarding incomplete observations before using a GAN, and (b) imputing missing values (using the GAIN algorithm) before using a GAN. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Figure 1

17 pages, 12551 KiB

Open AccessFeature PaperArticle

Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain

by Mikel Hernandez, Gorka Epelde, Andoni Beristain, Roberto Álvarez, Cristina Molina, Xabat Larrea, Ane Alberdi, Michalis Timoleon, Panagiotis Bamidis and Evdokimos Konstantinidis

Electronics 2022, 11(5), 812; https://doi.org/10.3390/electronics11050812 - 4 Mar 2022

Cited by 19 | Viewed by 10326

Abstract

To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone data preparation process and not integrated into a broader analysis or experiment testing workflow. In this context, the VITALISE project is working to harmonize Living Lab research and data capture protocols and to provide controlled processing access to captured data to industrial and scientific communities. In this paper, we present the initial design and implementation of our synthetic data generation approach in the context of VITALISE Living Lab controlled data processing workflow, together with identified challenges and future developments. By uploading data captured from Living Labs, generating synthetic data from them, developing analysis locally with synthetic data, and then executing them remotely with real data, the utility of the proposed workflow has been validated. Results have shown that the presented workflow helps accelerate research on artificial intelligence, ensuring compliance with data protection laws. The presented approach has demonstrated how the adoption of state-of-the-art synthetic data generation techniques can be applied for real-world applications. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Journal Menu

Journal Browser

Recent Advances in Synthetic Data Generation

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (6 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI