Next Issue
Volume 10, January
Previous Issue
Volume 9, November
 
 

Data, Volume 9, Issue 12 (December 2024) – 15 articles

Cover Story (view full-size image): The detection of activities of daily living finds application in healthcare monitoring, smart homes, and the energy management of buildings. While many existing datasets of human activities rely on wearable sensors, this study introduces a dataset captured with ambient sensors. These data facilitate use cases in which subjects are unable or unwilling to carry wearable devices. The data were collected in a supervised recording process, where 14 participants performed 25 different activities of daily living individually. Five identical multisensor devices captured the audio, vibration, infrared array data, light color, and environmental measurements, resulting in five multi-modal data channels. The labeled raw recordings are provided in a structured dataset. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
11 pages, 3341 KiB  
Data Descriptor
Advanced Methodology for Emulating Local Operating Conditions in Proton Exchange Membrane Fuel Cells
by Marine Cornet, Arnaud Morin, Jean-Philippe Poirot-Crouvezier and Yann Bultel
Data 2024, 9(12), 152; https://doi.org/10.3390/data9120152 - 20 Dec 2024
Viewed by 698
Abstract
This work focuses on the study of operating heterogeneities on a large MEA’s active surface area in a PEMFC stack. An advanced methodology is developed, aiming at the prediction of local operating conditions such as temperature, relative humidity and species concentration. A physics-based [...] Read more.
This work focuses on the study of operating heterogeneities on a large MEA’s active surface area in a PEMFC stack. An advanced methodology is developed, aiming at the prediction of local operating conditions such as temperature, relative humidity and species concentration. A physics-based Pseudo-3D model developed under COMSOL Multiphysics allows for the observation of heterogeneities over the entire active surface area. Once predicted, these local operating conditions are experimentally emulated, thanks to a differential cell, to provide the local polarization curves and electrochemical impedance spectra. Coupling simulation and experimental, thirty-seven local operating conditions are emulated to examine the degree of correlation between local operating conditions and PEMFC cell performances. Researchers and engineers can use the polarization curves and Electrochemical Impedance Spectroscopy diagrams to fit the variables of an empirical model or to validate the results of a theoretical model. Full article
Show Figures

Figure 1

26 pages, 4793 KiB  
Article
A Framework for Current and New Data Quality Dimensions: An Overview
by Russell Miller, Harvey Whelan, Michael Chrubasik, David Whittaker, Paul Duncan and João Gregório
Data 2024, 9(12), 151; https://doi.org/10.3390/data9120151 - 18 Dec 2024
Viewed by 1313
Abstract
This paper presents a comprehensive exploration of data quality terminology, revealing a significant lack of standardisation in the field. The goal of this work was to conduct a comparative analysis of data quality terminology across different domains and structure it into a hierarchical [...] Read more.
This paper presents a comprehensive exploration of data quality terminology, revealing a significant lack of standardisation in the field. The goal of this work was to conduct a comparative analysis of data quality terminology across different domains and structure it into a hierarchical data model. We propose a novel approach for aggregating disparate data quality terms used to describe the multiple facets of data quality under common umbrella terms with a focus on the ISO 25012 standard. We introduce four additional data quality dimensions: governance, usefulness, quantity, and semantics. These dimensions enhance specificity, complementing the framework established by the ISO 25012 standard, as well as contribute to a broad understanding of data quality aspects. The ISO 25012 standard, a general standard for managing the data quality in information systems, offers a foundation for the development of our proposed Data Quality Data Model. This is due to the prevalent nature of digital systems across a multitude of domains. In contrast, frameworks such as ALCOA+, which were originally developed for specific regulated industries, can be applied more broadly but may not always be generalisable. Ultimately, the model we propose aggregates and classifies data quality terminology, facilitating seamless communication of the data quality between different domains when collaboration is required to tackle cross-domain projects or challenges. By establishing this hierarchical model, we aim to improve understanding and implementation of data quality practices, thereby addressing critical issues in various domains. Full article
Show Figures

Figure 1

17 pages, 2053 KiB  
Data Descriptor
Genome-Scale DNA Methylome and Transcriptome Profiles of Prostate Cancer Recurrence After Prostatectomy
by Jim Smith, Priyadarshana Ajithkumar, Emma J. Wilkinson, Atreyi Dutta, Sai Shyam Vasantharajan, Angela Yee, Gregory Gimenez, Rathan M. Subramaniam, Michael Lau, Amir D. Zarrabi, Euan J. Rodger and Aniruddha Chatterjee
Data 2024, 9(12), 150; https://doi.org/10.3390/data9120150 - 16 Dec 2024
Viewed by 820
Abstract
Prostate cancer (PCa) is a major health burden worldwide, and despite early treatment, many patients present with biochemical recurrence (BCR) post-treatment, reflected by a rise in prostate-specific antigen (PSA) over a clinical threshold. Novel transcriptomic and epigenomic biomarkers can provide a powerful tools [...] Read more.
Prostate cancer (PCa) is a major health burden worldwide, and despite early treatment, many patients present with biochemical recurrence (BCR) post-treatment, reflected by a rise in prostate-specific antigen (PSA) over a clinical threshold. Novel transcriptomic and epigenomic biomarkers can provide a powerful tools for the clinical management of PCa. Here, we provide matched RNA sequencing and array-based genome-wide DNA methylome data of PCa patients (n = 17) with or without evidence of BCR following radical prostatectomy. Formalin-fixed paraffin-embedded (FFPE) tissues were used to generate these data, which included technical replicates to provide further validity of the data. We describe the sample features, experimental design, methods and bioinformatic pipelines for processing these multi-omic data. Importantly, comprehensive clinical, histopathological, and follow-up data for each patient were provided to enable the correlation of transcriptome and methylome features with clinical features. Our data will contribute towards the efforts of developing epigenomic and transcriptomic markers for BCR and also facilitate a deeper understanding of the molecular basis of PCa recurrence. Full article
Show Figures

Figure 1

12 pages, 1297 KiB  
Data Descriptor
Unlocking New Opportunities for Spatial Analysis of Farms’ Income and Business Activities in Italy: The Agricultural Regions in Shapefile Format
by Sara Quaresima, Pasquale Nino, Concetta Cardillo and Arianna Di Paola
Data 2024, 9(12), 149; https://doi.org/10.3390/data9120149 - 13 Dec 2024
Viewed by 643
Abstract
Italy is divided into 773 Agricultural Regions (ARs) based on shared physical and agronomic characteristics. These regions offer a valuable tool for analyzing various geographical, socio-economic, and environmental aspects of agriculture, including the climate. However, the ARs have lacked geospatial data, limiting their [...] Read more.
Italy is divided into 773 Agricultural Regions (ARs) based on shared physical and agronomic characteristics. These regions offer a valuable tool for analyzing various geographical, socio-economic, and environmental aspects of agriculture, including the climate. However, the ARs have lacked geospatial data, limiting their analytical potential. This study introduces the “Italian ARs Dataset”, a georeferenced shapefile defining the boundaries of each AR. This dataset facilitates geographical assessments of Italy’s complex agricultural sector. It also unlocks the potential for integrating AR data with other datasets like the Farm Accounting Data Network (FADN) dataset, in Italy represented by the Rete di Informazione Contabile Agricola (RICA), which samples hundreds of thousands of farms annually. To demonstrate the dataset’s utility, a large sample of RICA data encompassing 179 irrigated crops from 2011 to 2021, covering all of Italy, was retrieved. Validation confirmed successful assignment of all ARs present in the RICA sample to the corresponding shapefile. Additionally, to encourage the use of the ARs Dataset with gridded data, different spatial-scale resolutions are tested to identify a suitable threshold. The minimal spatial scale identified is 0.11 degrees, a commonly adopted scale by several climate datasets within the EURO-CORDEX and COPERNICUS programs. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

21 pages, 5097 KiB  
Data Descriptor
Teal-WCA: A Climate Services Platform for Planning Solar Photovoltaic and Wind Energy Resources in West and Central Africa in the Context of Climate Change
by Salomon Obahoundje, Arona Diedhiou, Alberto Troccoli, Penny Boorman, Taofic Abdel Fabrice Alabi, Sandrine Anquetin, Louise Crochemore, Wanignon Ferdinand Fassinou, Benoit Hingray, Daouda Koné, Chérif Mamadou and Fatogoma Sorho
Data 2024, 9(12), 148; https://doi.org/10.3390/data9120148 - 10 Dec 2024
Viewed by 997
Abstract
To address the growing electricity demand driven by population growth and economic development while mitigating climate change, West and Central African countries are increasingly prioritizing renewable energy as part of their Nationally Determined Contributions (NDCs). This study evaluates the implications of climate change [...] Read more.
To address the growing electricity demand driven by population growth and economic development while mitigating climate change, West and Central African countries are increasingly prioritizing renewable energy as part of their Nationally Determined Contributions (NDCs). This study evaluates the implications of climate change on renewable energy potential using ten downscaled and bias-adjusted CMIP6 models (CDFt method). Key climate variables—temperature, solar radiation, and wind speed—were analyzed and integrated into the Teal-WCA platform to aid in energy resource planning. Projected temperature increases of 0.5–2.7 °C (2040–2069) and 0.7–5.2 °C (2070–2099) relative to 1985–2014 underscore the need for strategies to manage the rising demand for cooling. Solar radiation reductions (~15 W/m2) may lower photovoltaic (PV) efficiency by 1–8.75%, particularly in high-emission scenarios, requiring a focus on system optimization and diversification. Conversely, wind speeds are expected to increase, especially in coastal regions, enhancing wind power potential by 12–50% across most countries and by 25–100% in coastal nations. These findings highlight the necessity of integrating climate-resilient energy policies that leverage wind energy growth while mitigating challenges posed by reduced solar radiation. By providing a nuanced understanding of the renewable energy potential under changing climatic conditions, this study offers actionable insights for sustainable energy planning in West and Central Africa. Full article
Show Figures

Figure 1

16 pages, 3695 KiB  
Article
Parallel Simplex, an Alternative to Classical Experimentation: A Case Study
by Francisco Zorrilla Briones, Inocente Yuliana Meléndez Pastrana, Manuel Alonso Rodríguez Morachis and José Luís Anaya Carrasco
Data 2024, 9(12), 147; https://doi.org/10.3390/data9120147 - 10 Dec 2024
Viewed by 702
Abstract
Experimentation is a strong methodology that improves and optimizes processes. Nevertheless, in many cases, real-life dynamics of production demands and other restrictions inhibit the use of these methodologies because their use implies stopping production, generating scrap, jeopardizing demand accomplishments, and other problems. Proposed [...] Read more.
Experimentation is a strong methodology that improves and optimizes processes. Nevertheless, in many cases, real-life dynamics of production demands and other restrictions inhibit the use of these methodologies because their use implies stopping production, generating scrap, jeopardizing demand accomplishments, and other problems. Proposed here is an alternative methodology to search for the best process variable levels and optimize the response of the process without the need to stop production. This algorithm is based on the principles of the Variable Simplex developed by Nelder and Mead and the continuous iterative process of EVOPS developed by Box, which is then modified as a simplex by Spendley. It is named parallel simplex because it searches for the best response with three independent Simplexes searching for the same response at the same time. The algorithm was designed for three simplexes of two input variables each. The case study documented shows that it is efficient and effective. Full article
Show Figures

Figure 1

23 pages, 7192 KiB  
Article
Data Decomposition Modeling Based on Improved Dung Beetle Optimization Algorithm for Wind Power Prediction
by Jiajian Ke and Tian Chen
Data 2024, 9(12), 146; https://doi.org/10.3390/data9120146 - 9 Dec 2024
Viewed by 714
Abstract
Accurate wind power forecasting is essential for maintaining the stability of a power system and enhancing scheduling efficiency in the power sector. To enhance prediction accuracy, this paper presents a hybrid wind power prediction model that integrates the improved complementary ensemble empirical mode [...] Read more.
Accurate wind power forecasting is essential for maintaining the stability of a power system and enhancing scheduling efficiency in the power sector. To enhance prediction accuracy, this paper presents a hybrid wind power prediction model that integrates the improved complementary ensemble empirical mode decomposition (ICEEMDAN), the RIME optimization algorithm (RIME), sample entropy (SE), the improved dung beetle optimization (IDBO) algorithm, the bidirectional long short-term memory (BiLSTM) network, and multi-head attention (MHA). In this model, RIME is utilized to improve the parameters of ICEEMDAN, reducing data decomposition complexity and effectively capturing the original data information. The IDBO algorithm is then utilized to improve the hyperparameters of the MHA-BiLSTM model. The proposed RIME-ICEEMDAN-IDBO-MHA-BiLSTM model is contrasted with ten others in ablation experiments to validate its performance. The experimental findings prove that the proposed model achieves MAPE values of 5.2%, 6.3%, 8.3%, and 5.8% across four datasets, confirming its superior predictive performance and higher accuracy. Full article
(This article belongs to the Topic Decision-Making and Data Mining for Sustainable Computing)
Show Figures

Figure 1

17 pages, 6026 KiB  
Article
Formalization for Subsequent Computer Processing of Kara Sea Coastline Data
by Daria Bogatova and Stanislav Ogorodov
Data 2024, 9(12), 145; https://doi.org/10.3390/data9120145 - 9 Dec 2024
Cited by 1 | Viewed by 620
Abstract
This study aimed to develop a methodological framework for predicting shoreline dynamics using machine learning techniques, focusing on analyzing generalized data without distinguishing areas with higher or lower retreat rates. Three sites along the southwestern Kara Sea coast were selected for this investigation. [...] Read more.
This study aimed to develop a methodological framework for predicting shoreline dynamics using machine learning techniques, focusing on analyzing generalized data without distinguishing areas with higher or lower retreat rates. Three sites along the southwestern Kara Sea coast were selected for this investigation. The study analyzed key coastal features, including lithology, permafrost, and geomorphology, using a combination of field studies and remote sensing data. Essential datasets were compiled and formatted for computer-based analysis. These datasets included information on permafrost and the geomorphological characteristics of the coastal zone, climatic factors influencing the shoreline, and measurements of bluff top positions and retreat rates over defined time periods. The positions of the bluff tops were determined through a combination of imagery with varying resolutions and field measurements. A novel aspect of the study involved employing geostatistical methods to analyze erosion rates, providing new insights into the shoreline dynamics. The data analysis allowed us to identify coastal areas experiencing the most significant changes. By continually refining neural network models with these datasets, we can improve our understanding of the complex interactions between natural factors and shoreline evolution, ultimately aiding in developing effective coastal management strategies. Full article
Show Figures

Figure 1

18 pages, 11734 KiB  
Data Descriptor
Multi-Modal Dataset of Human Activities of Daily Living with Ambient Audio, Vibration, and Environmental Data
by Thomas Pfitzinger, Marcel Koch, Fabian Schlenke and Hendrik Wöhrle
Data 2024, 9(12), 144; https://doi.org/10.3390/data9120144 - 9 Dec 2024
Viewed by 3287
Abstract
The detection of human activities is an important step in automated systems to understand the context of given situations. It can be useful for applications like healthcare monitoring, smart homes, and energy management systems for buildings. To achieve this, a sufficient data basis [...] Read more.
The detection of human activities is an important step in automated systems to understand the context of given situations. It can be useful for applications like healthcare monitoring, smart homes, and energy management systems for buildings. To achieve this, a sufficient data basis is required. The presented dataset contains labeled recordings of 25 different activities of daily living performed individually by 14 participants. The data were captured by five multisensors in supervised sessions in which a participant repeated each activity several times. Flawed recordings were removed, and the different data types were synchronized to provide multi-modal data for each activity instance. Apart from this, the data are presented in raw form, and no further filtering was performed. The dataset comprises ambient audio and vibration, as well as infrared array data, light color and environmental measurements. Overall, 8615 activity instances are included, each captured by the five multisensor devices. These multi-modal and multi-channel data allow various machine learning approaches to the recognition of human activities, for example, federated learning and sensor fusion. Full article
Show Figures

Figure 1

21 pages, 6383 KiB  
Article
A Data Storage, Analysis, and Project Administration Engine (TMFdw) for Small- to Medium-Size Interdisciplinary Ecological Research Programs with Full Raster Data Capabilities
by Paulina Grigusova, Christian Beilschmidt, Maik Dobbermann, Johannes Drönner, Michael Mattig, Pablo Sanchez, Nina Farwig and Jörg Bendix
Data 2024, 9(12), 143; https://doi.org/10.3390/data9120143 - 6 Dec 2024
Viewed by 677
Abstract
Over almost 20 years, a data storage, analysis, and project administration engine (TMFdw) has been continuously developed in a series of several consecutive interdisciplinary research projects on functional biodiversity of the southern Andes of Ecuador. Starting as a “working database”, the system now [...] Read more.
Over almost 20 years, a data storage, analysis, and project administration engine (TMFdw) has been continuously developed in a series of several consecutive interdisciplinary research projects on functional biodiversity of the southern Andes of Ecuador. Starting as a “working database”, the system now includes program management modules and literature databases, which are all accessible via a web interface. Originally designed to manage data in the ecological Research Unit 816 (SE Ecuador), the open software is now being used in several other environmental research programs, demonstrating its broad applicability. While the system was mainly developed for abiotic and biotic tabular data in the beginning, the new research program demands full capabilities to work with area-wide and high-resolution big models and remote sensing raster data. Thus, a raster engine was recently implemented based on the Geo Engine technology. The great variety of pre-implemented desktop GIS-like analysis options for raster point and vector data is an important incentive for researchers to use the system. A second incentive is to implement use cases prioritized by the researchers. As an example, we present machine learning models to generate high-resolution (30 m) microclimate raster layers for the study area in different temporal aggregation levels for the most important variables of air temperature, humidity, precipitation, and solar radiation. The models implemented as use cases outperform similar models developed in other research programs. Full article
Show Figures

Figure 1

20 pages, 12402 KiB  
Article
Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network
by Yiya Diao, Changhe Li, Sanyou Zeng and Shengxiang Yang
Data 2024, 9(12), 142; https://doi.org/10.3390/data9120142 - 6 Dec 2024
Viewed by 694
Abstract
Contaminant Source Identification in Water Distribution Network (CSWIDN) is critical for ensuring public health, and optimization algorithms are commonly used to solve this complex problem. However, these algorithms are highly sensitive to the problem’s landscape features, which has limited their effectiveness in practice. [...] Read more.
Contaminant Source Identification in Water Distribution Network (CSWIDN) is critical for ensuring public health, and optimization algorithms are commonly used to solve this complex problem. However, these algorithms are highly sensitive to the problem’s landscape features, which has limited their effectiveness in practice. Despite this, there has been little experimental analysis of the fitness landscape for CSWIDN, particularly given its mixed-encoding nature. This study addresses this gap by conducting a comprehensive fitness landscape analysis of CSWIDN using the Nearest-Better Network (NBN), the only applicable method for mixed-encoding problems. Our analysis reveals for the first time that CSWIDN exhibits the landscape features, including neutrality, ruggedness, modality, dynamic change, and separability. These findings not only deepen our understanding of the problem’s inherent landscape features but also provide quantitative insights into how these features influence algorithm performance. Additionally, based on these insights, we propose specific algorithm design recommendations that are better suited to the unique challenges of the CSWIDN problem. This work advances the knowledge of CSWIDN optimization by both qualitatively characterizing its landscape and quantitatively linking these features to algorithms’ behaviors. Full article
(This article belongs to the Topic Water and Energy Monitoring and Their Nexus)
Show Figures

Figure 1

6 pages, 1360 KiB  
Data Descriptor
A Dataset of Plant Species Richness in Chinese National Nature Reserves
by Chunjing Wang, Wuxian Yan and Jizhong Wan
Data 2024, 9(12), 141; https://doi.org/10.3390/data9120141 - 30 Nov 2024
Viewed by 834
Abstract
This comprehensive dataset on the number of plant species, genera, and families in 383 national nature reserves in China has been compiled based on the available literature. Heilongjiang Province and the Guangxi Zhuang Autonomous Region have the highest number of nature reserves. Species [...] Read more.
This comprehensive dataset on the number of plant species, genera, and families in 383 national nature reserves in China has been compiled based on the available literature. Heilongjiang Province and the Guangxi Zhuang Autonomous Region have the highest number of nature reserves. Species richness is relatively high in the Jinfoshan, Dabashan, Wenshan, Hupingshan, and Shennongjia Nature Reserves. This dataset provides important baseline information on plant species richness coupling with genus and family numbers in Chinese national nature reserves and should help researchers and environmentalists understand the dynamic species changes in various nature reserves. This detailed and reliable information may serve as the foundation for future plant research in Chinese nature reserves and play a positive role in promoting more effective natural protection, biological distribution, and biodiversity conservation in these areas. Full article
Show Figures

Figure 1

21 pages, 6066 KiB  
Article
Algorithm for Trajectory Simplification Based on Multi-Point Construction in Preselected Area and Noise Smoothing Processing
by Simin Huang and Zhiying Yang
Data 2024, 9(12), 140; https://doi.org/10.3390/data9120140 - 29 Nov 2024
Viewed by 664
Abstract
Simplifying trajectory data can improve the efficiency of trajectory data analysis and query and reduce the communication cost and computational overhead of trajectory data. In this paper, a real-time trajectory simplification algorithm (SSFI) based on the spatio-temporal feature information of implicit trajectory points [...] Read more.
Simplifying trajectory data can improve the efficiency of trajectory data analysis and query and reduce the communication cost and computational overhead of trajectory data. In this paper, a real-time trajectory simplification algorithm (SSFI) based on the spatio-temporal feature information of implicit trajectory points is proposed. The algorithm constructs the preselected area through the error measurement method based on the feature information of implicit trajectory points (IEDs) proposed in this paper, predicts the falling point of trajectory points, and realizes the one-way error-bounded simplified trajectory algorithm. Experiments show that the simplified algorithm has obvious progress in three aspects: running speed, compression accuracy, and simplification rate. When the trajectory data scale is large, the performance of the algorithm is much better than that of other line segment simplification algorithms. The GPS error cannot be avoided. The Kalman filter smoothing trajectory can effectively eliminate the influence of noise and significantly improve the performance of the simplified algorithm. According to the characteristics of the trajectory data, this paper accurately constructs a mathematical model to describe the motion state of objects, so that the performance of the Kalman filter is better than other filters when smoothing trajectory data. In this paper, the trajectory data smoothing experiment is carried out by adding random Gaussian noise to the trajectory data. The experiment shows that the Kalman filter’s performance under the mathematical model is better than other filters. Full article
Show Figures

Figure 1

32 pages, 969 KiB  
Article
Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
by Marcello Buoncristiano, Giansalvatore Mecca, Donatello Santoro and Enzo Veltri
Data 2024, 9(12), 139; https://doi.org/10.3390/data9120139 - 25 Nov 2024
Viewed by 638
Abstract
In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want [...] Read more.
In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

16 pages, 362 KiB  
Article
CARE to Compare: A Real-World Benchmark Dataset for Early Fault Detection in Wind Turbine Data
by Christian Gück, Cyriana M. A. Roelofs and Stefan Faulstich
Data 2024, 9(12), 138; https://doi.org/10.3390/data9120138 - 23 Nov 2024
Viewed by 1288
Abstract
Early fault detection plays a crucial role in the field of predictive maintenance for wind turbines, yet the comparison of different algorithms poses a difficult task because domain-specific public datasets are scarce. Many comparisons of different approaches either use benchmarks composed of data [...] Read more.
Early fault detection plays a crucial role in the field of predictive maintenance for wind turbines, yet the comparison of different algorithms poses a difficult task because domain-specific public datasets are scarce. Many comparisons of different approaches either use benchmarks composed of data from many different domains, inaccessible data, or one of the few publicly available datasets that lack detailed information about the faults. Moreover, many publications highlight a couple of case studies where fault detection was successful. With this paper, we publish a high quality dataset that contains data from 36 wind turbines across 3 different wind farms as well as the most detailed fault information of any public wind turbine dataset as far as we know. The new dataset contains 89 years worth of real-world operating data of wind turbines, distributed across 44 labeled time frames for anomalies that led up to faults, as well as 51 time series representing normal behavior. Additionally, the quality of training data is ensured by turbine-status-based labels for each data point. Furthermore, we propose a new scoring method, called CARE (Coverage, Accuracy, Reliability and Earliness), which takes advantage of the information depth that is present in the dataset to identify good early fault detection models for wind turbines. This score considers the anomaly detection performance, the ability to recognize normal behavior properly, and the capability to raise as few false alarms as possible while simultaneously detecting anomalies early. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop