Next Issue
Volume 8, October
Previous Issue
Volume 8, August
 
 

Data, Volume 8, Issue 9 (September 2023) – 10 articles

Cover Story (view full-size image): Stimulating emotional states is crucial in therapeutic settings and is often achieved using visual exposure. Significant tools in this area are affective databases, like the Nencki Affective Picture System (NAPS), which stores standardized emotion-evoking images. While the NAPS offers a wealth of realistic stimuli, it also highlights the challenges faced by current affective databases, such as inconsistent semantic descriptions and the need for more semantic integration. This is where the presented Knowledge Graph (KG) dataset plays a pivotal role. KGs provide a structured and detailed representation of information, offering a high-level understanding of the content and context of each emotion-evoking picture, thereby improving the accuracy and efficiency of these databases. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
11 pages, 960 KiB  
Data Descriptor
Potential Range Map Dataset of Indian Birds
by Arpit Deomurari, Ajay Sharma, Dipankar Ghose and Randeep Singh
Data 2023, 8(9), 144; https://doi.org/10.3390/data8090144 - 21 Sep 2023
Viewed by 3328
Abstract
Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difficult [...] Read more.
Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difficult to access in proper data formats for analysis and conservation planning of species. In this study, we addressed this issue by developing Species Distribution Models (SDMs) that integrate species presence data from various citizen science initiatives. This allowed us to systematically construct current distribution maps for 1091 bird species across India. To create these SDMs, we used MaxEnt 3.4.4 (Maximum Entropy) as the base for species distribution modelling and combined it with multiple citizen science datasets containing information on species occurrence and 29 environmental variables. Using this method, we were able to estimate species distribution maps at both a national scale and a high spatial resolution of 1 km2. Thus, the results of our study provide species current species distribution maps for 968 bird species found in India. These maps significantly improve our knowledge of the geographic distribution of about 75% of India’s bird species and are essential for addressing spatial knowledge gaps for conservation issues. Additionally, by superimposing the distribution maps of different species, we can locate hotspots for bird diversity and align conservation action. Full article
Show Figures

Figure 1

24 pages, 6418 KiB  
Article
A New Odd Beta Prime-Burr X Distribution with Applications to Petroleum Rock Sample Data and COVID-19 Mortality Rate
by Ahmad Abubakar Suleiman, Hanita Daud, Narinderjit Singh Sawaran Singh, Aliyu Ismail Ishaq and Mahmod Othman
Data 2023, 8(9), 143; https://doi.org/10.3390/data8090143 - 19 Sep 2023
Cited by 5 | Viewed by 1889
Abstract
In this article, we pioneer a new Burr X distribution using the odd beta prime generalized (OBP-G) family of distributions called the OBP-Burr X (OBPBX) distribution. The density function of this model is symmetric, left-skewed, right-skewed, and reversed-J, while the hazard function is [...] Read more.
In this article, we pioneer a new Burr X distribution using the odd beta prime generalized (OBP-G) family of distributions called the OBP-Burr X (OBPBX) distribution. The density function of this model is symmetric, left-skewed, right-skewed, and reversed-J, while the hazard function is monotonically increasing, decreasing, bathtub, and N-shaped, making it suitable for modeling skewed data and failure rates. Various statistical properties of the new model are obtained, such as moments, moment-generating function, entropies, quantile function, and limit behavior. The maximum-likelihood-estimation procedure is utilized to determine the parameters of the model. A Monte Carlo simulation study is implemented to ascertain the efficiency of maximum-likelihood estimators. The findings demonstrate the empirical application and flexibility of the OBPBX distribution, as showcased through its analysis of petroleum rock samples and COVID-19 mortality data, along with its superior performance compared to well-known extended versions of the Burr X distribution. We anticipate that the new distribution will attract a wider readership and provide a vital tool for modeling various phenomena in different domains. Full article
Show Figures

Figure 1

8 pages, 212 KiB  
Communication
Update of Dietary Supplement Label Database Addressing on Coding in Italy
by Giorgia Perelli, Roberta Bernini, Massimo Lucarini and Alessandra Durazzo
Data 2023, 8(9), 142; https://doi.org/10.3390/data8090142 - 13 Sep 2023
Cited by 1 | Viewed by 2103
Abstract
Harmonized composition data for foods and dietary supplements are needed for research and for policy decision making. For a correct assessment of dietary intake, the categorization and the classification of food products and dietary supplements are necessary. In recent decades, the marketing of [...] Read more.
Harmonized composition data for foods and dietary supplements are needed for research and for policy decision making. For a correct assessment of dietary intake, the categorization and the classification of food products and dietary supplements are necessary. In recent decades, the marketing of dietary supplements has increased. A food supplements-based database has, as a principal feature, an intrinsic dynamism related to the continuous changes in formulations, which consequently leads to the need for constant monitoring of the market and for regular updates of the database. This study presents an update to the Dietary Supplement Label Database in Italy focused on dietary supplements coding. The updated dataset here, presented for the first time, consists of the codes of 216 dietary supplements currently on the market in Italy that have functional foods as their characterizing ingredients, throughout the two commonly most used description and classification systems: LanguaLTM and FoodEx2-. This update represents a unique tool and guideline for other compilers and users for applying classification coding systems to dietary supplements. Moreover, this updated dataset represents a valuable resource for several applications such as epidemiological investigations, exposure studies, and dietary assessment. Full article
Show Figures

Graphical abstract

18 pages, 7583 KiB  
Data Descriptor
Thailand Raw Water Quality Dataset Analysis and Evaluation
by Jaturapith Krohkaew, Pongpon Nilaphruek, Niti Witthayawiroj, Sakchai Uapipatanakul, Yamin Thwe and Padma Nyoman Crisnapati
Data 2023, 8(9), 141; https://doi.org/10.3390/data8090141 - 4 Sep 2023
Cited by 4 | Viewed by 3594
Abstract
Sustainable water quality data are important for understanding historical variability and trends in river regimes, as well as the impact of industrial waste on the health of aquatic ecosystems. Sustainable water management practices heavily depend on reliable and comprehensive data, prompting the need [...] Read more.
Sustainable water quality data are important for understanding historical variability and trends in river regimes, as well as the impact of industrial waste on the health of aquatic ecosystems. Sustainable water management practices heavily depend on reliable and comprehensive data, prompting the need for accurate monitoring and assessment of water quality parameters. This research describes a reconstructed daily water quality dataset that complements rare historical observations for six station points along the Chao Phraya River in Thailand. Internet of Things technology and a Eureka water probe sensor is used to collect and reconstruct the water quality dataset for the period from June 2022–February 2023, with Turbidity, Optical Dissolved Oxygen, Dissolved Oxygen Saturation, Spatial Conductivity, Acidity/Basicity, Total Dissolved Solids, Salinity, Temperature, Chlorophyll, and Depth as the recorded parameters from six different stations. The presented dataset comprises a total of 211,322 data points, which are separated into six CSV files. The dataset is then evaluated using the Long Short-Term Memory (LSTM) algorithm with a Mean Squared Error (MSE) of 0.0012256, and Root Mean Squared Error (RMSE) of 0.0350080. The proposed dataset provides valuable insights for researchers studying river ecosystems, supporting informed decision-making and sustainable water management practices. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

12 pages, 429 KiB  
Article
Employing Source Code Quality Analytics for Enriching Code Snippets Data
by Thomas Karanikiotis, Themistoklis Diamantopoulos and Andreas Symeonidis
Data 2023, 8(9), 140; https://doi.org/10.3390/data8090140 - 31 Aug 2023
Cited by 2 | Viewed by 2700
Abstract
The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, [...] Read more.
The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, especially in terms of readability, making their integration and upkeep simpler. Toward this direction, we have developed a dataset of code snippets that takes into account both the functional and the quality characteristics of the snippets. The dataset is based on the CodeSearchNet corpus and comprises additional information, including static analysis metrics, code violations, readability assessments, and source code similarity metrics. Thus, using this dataset, both software researchers and practitioners can conveniently find and employ code snippets that satisfy diverse functional needs while also demonstrating excellent readability and maintainability. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

31 pages, 3457 KiB  
Data Descriptor
Dataset of Multi-Aspect Integrated Migration Indicators
by Diletta Goglia, Laura Pollacci and Alina Sîrbu
Data 2023, 8(9), 139; https://doi.org/10.3390/data8090139 - 31 Aug 2023
Cited by 1 | Viewed by 2030
Abstract
Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. New knowledge extracted from these data must be validated [...] Read more.
Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. New knowledge extracted from these data must be validated using traditional data, which are however distributed across different sources and difficult to integrate. In this context we present the Multi-aspect Integrated Migration Indicators (MIMI) dataset, a new dataset of migration indicators (flows and stocks) and possible migration drivers (cultural, economic, demographic and geographic indicators). This was obtained through acquisition, transformation and integration of disparate traditional datasets together with social network data from Facebook (Social Connectedness Index). This article describes the process of gathering, embedding and merging traditional and novel variables, resulting in this new multidisciplinary dataset that we believe could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers. Full article
Show Figures

Figure 1

17 pages, 4005 KiB  
Article
Using Landsat-5 for Accurate Historical LULC Classification: A Comparison of Machine Learning Models
by Denis Krivoguz, Sergei G. Chernyi, Elena Zinchenko, Artem Silkin and Anton Zinchenko
Data 2023, 8(9), 138; https://doi.org/10.3390/data8090138 - 30 Aug 2023
Cited by 10 | Viewed by 2760
Abstract
This study investigates the application of various machine learning models for land use and land cover (LULC) classification in the Kerch Peninsula. The study utilizes archival field data, cadastral data, and published scientific literature for model training and testing, using Landsat-5 imagery from [...] Read more.
This study investigates the application of various machine learning models for land use and land cover (LULC) classification in the Kerch Peninsula. The study utilizes archival field data, cadastral data, and published scientific literature for model training and testing, using Landsat-5 imagery from 1990 as input data. Four machine learning models (deep neural network, Random Forest, support vector machine (SVM), and AdaBoost) are employed, and their hyperparameters are tuned using random search and grid search. Model performance is evaluated through cross-validation and confusion matrices. The deep neural network achieves the highest accuracy (96.2%) and performs well in classifying water, urban lands, open soils, and high vegetation. However, it faces challenges in classifying grasslands, bare lands, and agricultural areas. The Random Forest model achieves an accuracy of 90.5% but struggles with differentiating high vegetation from agricultural lands. The SVM model achieves an accuracy of 86.1%, while the AdaBoost model performs the lowest with an accuracy of 58.4%. The novel contributions of this study include the comparison and evaluation of multiple machine learning models for land use classification in the Kerch Peninsula. The deep neural network and Random Forest models outperform SVM and AdaBoost in terms of accuracy. However, the use of limited data sources such as cadastral data and scientific articles may introduce limitations and potential errors. Future research should consider incorporating field studies and additional data sources for improved accuracy. This study provides valuable insights for land use classification, facilitating the assessment and management of natural resources in the Kerch Peninsula. The findings contribute to informed decision-making processes and lay the groundwork for further research in the field. Full article
Show Figures

Figure 1

21 pages, 3462 KiB  
Article
A Framework for Evaluating Renewable Energy for Decision-Making Integrating a Hybrid FAHP-TOPSIS Approach: A Case Study in Valle del Cauca, Colombia
by Mateo Barrera-Zapata, Fabian Zuñiga-Cortes and Eduardo Caicedo-Bravo
Data 2023, 8(9), 137; https://doi.org/10.3390/data8090137 - 30 Aug 2023
Cited by 1 | Viewed by 2205
Abstract
At present, the energy landscape of many countries faces transformational challenges driven by sustainable development objectives, supported by the implementation of clean technologies, such as renewable energy sources, to meet the flexibility and diversification needs of the traditional energy mix. However, integrating these [...] Read more.
At present, the energy landscape of many countries faces transformational challenges driven by sustainable development objectives, supported by the implementation of clean technologies, such as renewable energy sources, to meet the flexibility and diversification needs of the traditional energy mix. However, integrating these technologies requires a thorough study of the context in which they are developed. Furthermore, it is necessary to carry out an analysis from a sustainable approach that quantifies the impact of proposals on multiple objectives established by stakeholders. This article presents a framework for analysis that integrates a method for evaluating the technical feasibility of resources for photovoltaic solar, wind, small hydroelectric power, and biomass generation. These resources are used to construct a set of alternatives and are evaluated using a hybrid FAHP-TOPSIS approach. FAHP-TOPSIS is used as a comparison technique among a collection of technical, economic, and environmental criteria, ranking the alternatives considering their level of trade-off between criteria. The results of a case study in Valle del Cauca (Colombia) offer a wide range of alternatives and indicate a combination of 50% biomass, and 50% solar as the best, assisting in decision-making for the correct use of available resources and maximizing the benefits for stakeholders. Full article
Show Figures

Graphical abstract

15 pages, 5088 KiB  
Data Descriptor
Knowledge Graph Dataset for Semantic Enrichment of Picture Description in NAPS Database
by Marko Horvat, Gordan Gledec, Tomislav Jagušt and Zoran Kalafatić
Data 2023, 8(9), 136; https://doi.org/10.3390/data8090136 - 24 Aug 2023
Cited by 1 | Viewed by 1706
Abstract
This data description introduces a comprehensive knowledge graph (KG) dataset with detailed information about the relevant high-level semantics of visual stimuli used to induce emotional states stored in the Nencki Affective Picture System (NAPS) repository. The dataset contains 6808 systematically manually assigned annotations [...] Read more.
This data description introduces a comprehensive knowledge graph (KG) dataset with detailed information about the relevant high-level semantics of visual stimuli used to induce emotional states stored in the Nencki Affective Picture System (NAPS) repository. The dataset contains 6808 systematically manually assigned annotations for 1356 NAPS pictures in 5 categories, linked to WordNet synsets and Suggested Upper Merged Ontology (SUMO) concepts presented in a tabular format. Both knowledge databases provide an extensive and supervised taxonomy glossary suitable for describing picture semantics. The annotation glossary consists of 935 WordNet and 513 SUMO entities. A description of the dataset and the specific processes used to collect, process, review, and publish the dataset as open data are also provided. This dataset is unique in that it captures complex objects, scenes, actions, and the overall context of emotional stimuli with knowledge taxonomies at a high level of quality. It provides a valuable resource for a variety of projects investigating emotion, attention, and related phenomena. In addition, researchers can use this dataset to explore the relationship between emotions and high-level semantics or to develop data-retrieval tools to generate personalized stimuli sequences. The dataset is freely available in common formats (Excel and CSV). Full article
Show Figures

Figure 1

20 pages, 5675 KiB  
Article
Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
by Winston Wang and Tun-Wen Pai
Data 2023, 8(9), 135; https://doi.org/10.3390/data8090135 - 23 Aug 2023
Cited by 7 | Viewed by 3718
Abstract
This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining [...] Read more.
This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures. Full article
(This article belongs to the Topic Machine Learning Techniques Driven Medicine Analysis)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop