Next Issue
Volume 7, March
Previous Issue
Volume 7, January
 
 

Data, Volume 7, Issue 2 (February 2022) – 14 articles

Cover Story (view full-size image): The ever-increasing amount of data generated from experiments and simulations in engineering sciences is relying more and more on data science applications to generate new knowledge. Comprehensive metadata descriptions and a suitable research data infrastructure are essential prerequisites for these tasks. Experimental tribology in particular presents some unique challenges in this regard due to the interdisciplinary nature of the field and the lack of existing standards. In this work we demonstrate the versatility of the open source research data infrastructure Kadi4Mat by managing and producing FAIR tribological data. As a showcase example a tribological experiment is conducted by an experimental group with a focus on comprehensiveness. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
19 pages, 2685 KiB  
Article
A Mixture Hidden Markov Model to Mine Students’ University Curricula
by Silvia Bacci and Bruno Bertaccini
Data 2022, 7(2), 25; https://doi.org/10.3390/data7020025 - 21 Feb 2022
Cited by 2 | Viewed by 3069
Abstract
In the context of higher education, the wide availability of data gathered by universities for administrative purposes or for recording the evolution of students’ learning processes makes novel data mining techniques particularly useful to tackle critical issues. In Italy, current academic regulations allow [...] Read more.
In the context of higher education, the wide availability of data gathered by universities for administrative purposes or for recording the evolution of students’ learning processes makes novel data mining techniques particularly useful to tackle critical issues. In Italy, current academic regulations allow students to customize the chronological sequence of courses they have to attend to obtain the final degree. This leads to a variety of sequences of exams, with an average time taken to obtain the degree that may significantly differ from the time established by law. In this contribution, we propose a mixture hidden Markov model to classify students into groups that are homogenous in terms of university paths, with the aim of detecting bottlenecks in the academic career and improving students’ performance. Full article
(This article belongs to the Special Issue Education Data Mining)
Show Figures

Figure 1

11 pages, 799 KiB  
Data Descriptor
British Columbia’s Index of Multiple Deprivation for Community Health Service Areas
by Sharon Relova, Yayuk Joffres, Drona Rasali, Li Rita Zhang, Geoffrey McKee and Naveed Janjua
Data 2022, 7(2), 24; https://doi.org/10.3390/data7020024 - 21 Feb 2022
Cited by 6 | Viewed by 5728
Abstract
Area-based socio-economic indicators, such as the Canadian Index of Multiple Deprivation (CIMD), have been used in equity analyses to inform strategies to improve needs-based, timely, and effective patient care and public health services to communities. The CIMD comprises four dimensions of deprivation: residential [...] Read more.
Area-based socio-economic indicators, such as the Canadian Index of Multiple Deprivation (CIMD), have been used in equity analyses to inform strategies to improve needs-based, timely, and effective patient care and public health services to communities. The CIMD comprises four dimensions of deprivation: residential instability, economic dependency, ethno-cultural composition, and situational vulnerability. Using the CIMD methodology, the British Columbia Index of Multiple Deprivation (BCIMD) was developed to create indexes at the Community Health Services Area (CHSA) level in British Columbia (BC). BCIMD indexes are reported by quintiles, where quintile 1 represents the least deprived (or ethno-culturally diverse), and quintile 5 is the most deprived (or diverse). Distinctive characteristics of a community can be captured using the BCIMD, where a given CHSA may have a high level of deprivation in one dimension and a low level of deprivation in another. The utility of this data as a surveillance tool to monitor population demography has been used to inform decision making in healthcare by stakeholders in the regional health authorities and governmental agencies. The data have also been linked to health care data, such as COVID-19 case incidence and vaccination coverage, to understand the epidemiology of disease burden through an equity lens. Full article
Show Figures

Figure 1

13 pages, 12355 KiB  
Data Descriptor
Dataset for the Heat-Up and Heat Transfer towards Single Particles and Synthetic Particle Clusters from Particle-Resolved CFD Simulations
by Mario Pichler, Markus Bösenhofer and Michael Harasek
Data 2022, 7(2), 23; https://doi.org/10.3390/data7020023 - 14 Feb 2022
Cited by 1 | Viewed by 2793
Abstract
Heat transfer to particles is a key aspect of thermo-chemical conversion of pulverized fuels. These fuels tend to agglomerate in some areas of turbulent flow and to form particle clusters. Heat transfer and drag of such clusters are significantly different from single-particle approximations [...] Read more.
Heat transfer to particles is a key aspect of thermo-chemical conversion of pulverized fuels. These fuels tend to agglomerate in some areas of turbulent flow and to form particle clusters. Heat transfer and drag of such clusters are significantly different from single-particle approximations commonly used in Euler–Lagrange models. This fact prompted a direct numerical investigation of the heat transfer and drag behavior of synthetic particle clusters consisting of 44 spheres of uniform diameter (60 μm). Particle-resolved computational fluid dynamic simulations were carried out to investigate the heat fluxes, the forces acting upon the particle cluster, and the heat-up times of particle clusters with multiple void fractions (0.477–0.999) and varying relative velocities (0.5–25 m/s). The integral heat fluxes and exact particle positions for each particle in the cluster, integral heat fluxes, and the total acting force, derived from steady-state simulations, are reported for 85 different cases. The heat-up times of individual particles and the particle clusters are provided for six cases (three cluster void fractions and two relative velocities each). Furthermore, the heat-up times of single particles with different commonly used representative particle diameters are presented. Depending on the case, the particle Reynolds number, the cluster void fraction, the Nusselt number, and the cluster drag coefficient are included in the secondary data. Full article
Show Figures

Figure 1

18 pages, 2416 KiB  
Review
The Comparison of Cybersecurity Datasets
by Ahmed Alshaibi, Mustafa Al-Ani, Abeer Al-Azzawi, Anton Konev and Alexander Shelupanov
Data 2022, 7(2), 22; https://doi.org/10.3390/data7020022 - 29 Jan 2022
Cited by 22 | Viewed by 8809
Abstract
Almost all industrial internet of things (IIoT) attacks happen at the data transmission layer according to a majority of the sources. In IIoT, different machine learning (ML) and deep learning (DL) techniques are used for building the intrusion detection system (IDS) and models [...] Read more.
Almost all industrial internet of things (IIoT) attacks happen at the data transmission layer according to a majority of the sources. In IIoT, different machine learning (ML) and deep learning (DL) techniques are used for building the intrusion detection system (IDS) and models to detect the attacks in any layer of its architecture. In this regard, minimizing the attacks could be the major objective of cybersecurity, while knowing that they cannot be fully avoided. The number of people resisting the attacks and protection system is less than those who prepare the attacks. Well-reasoned and learning-backed problems must be addressed by the cyber machine, using appropriate methods alongside quality datasets. The purpose of this paper is to describe the development of the cybersecurity datasets used to train the algorithms which are used for building IDS detection models, as well as analyzing and summarizing the different and famous internet of things (IoT) attacks. This is carried out by assessing the outlines of various studies presented in the literature and the many problems with IoT threat detection. Hybrid frameworks have shown good performance and high detection rates compared to standalone machine learning methods in a few experiments. It is the researchers’ recommendation to employ hybrid frameworks to identify IoT attacks for the foreseeable future. Full article
Show Figures

Figure 1

19 pages, 1423 KiB  
Article
Development of a Web-Based Prediction System for Students’ Academic Performance
by Dabiah Alboaneen, Modhe Almelihi, Rawan Alsubaie, Raneem Alghamdi, Lama Alshehri and Renad Alharthi
Data 2022, 7(2), 21; https://doi.org/10.3390/data7020021 - 29 Jan 2022
Cited by 22 | Viewed by 8990
Abstract
Educational Data Mining (EDM) is used to extract and discover interesting patterns from educational institution datasets using Machine Learning (ML) algorithms. There is much academic information related to students available. Therefore, it is helpful to apply data mining to extract factors affecting students’ [...] Read more.
Educational Data Mining (EDM) is used to extract and discover interesting patterns from educational institution datasets using Machine Learning (ML) algorithms. There is much academic information related to students available. Therefore, it is helpful to apply data mining to extract factors affecting students’ academic performance. In this paper, a web-based system for predicting academic performance and identifying students at risk of failure through academic and demographic factors is developed. The ML model is developed to predict the total score of a course at the early stages. Several ML algorithms are applied, namely: Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Artificial Neural Network (ANN), and Linear Regression (LR). This model applies to the data of female students of the Computer Science Department at Imam Abdulrahman bin Faisal University (IAU). The dataset contains 842 instances for 168 students. Moreover, the results showed that the prediction’s Mean Absolute Percentage Error (MAPE) reached 6.34%, and the academic factors had a higher impact on students’ academic performance than the demographic factors, the midterm exam score in the top. The developed web-based prediction system is available on an online server and can be used by tutors. Full article
(This article belongs to the Special Issue Education Data Mining)
Show Figures

Figure 1

14 pages, 431 KiB  
Article
Collaborative Data Use between Private and Public Stakeholders—A Regional Case Study
by Claire Jean-Quartier, Miguel Rey Mazón, Mario Lovrić and Sarah Stryeck
Data 2022, 7(2), 20; https://doi.org/10.3390/data7020020 - 28 Jan 2022
Cited by 9 | Viewed by 5281
Abstract
Research and development are facilitated by sharing knowledge bases, and the innovation process benefits from collaborative efforts that involve the collective utilization of data. Until now, most companies and organizations have produced and collected various types of data, and stored them in data [...] Read more.
Research and development are facilitated by sharing knowledge bases, and the innovation process benefits from collaborative efforts that involve the collective utilization of data. Until now, most companies and organizations have produced and collected various types of data, and stored them in data silos that still have to be integrated with one another in order to enable knowledge creation. For this to happen, both public and private actors must adopt a flexible approach to achieve the necessary transition to break data silos and create collaborative data sharing between data producers and users. In this paper, we investigate several factors influencing cooperative data usage and explore the challenges posed by the participation in cross-organizational data ecosystems by performing an interview study among stakeholders from private and public organizations in the context of the project IDE@S, which aims at fostering the cooperation in data science in the Austrian federal state of Styria. We highlight technological and organizational requirements of data infrastructure, expertise, and practises towards collaborative data usage. Full article
(This article belongs to the Special Issue A European Approach to the Establishment of Data Spaces)
Show Figures

Figure 1

4 pages, 168 KiB  
Editorial
Acknowledgment to Reviewers of Data in 2021
by Data Editorial Office
Data 2022, 7(2), 19; https://doi.org/10.3390/data7020019 - 28 Jan 2022
Viewed by 1828
Abstract
Rigorous peer-reviews are the basis of high-quality academic publishing [...] Full article
16 pages, 544 KiB  
Article
An Empirical Study on Data Validation Methods of Delphi and General Consensus
by Puthearath Chan
Data 2022, 7(2), 18; https://doi.org/10.3390/data7020018 - 27 Jan 2022
Cited by 17 | Viewed by 5545
Abstract
Data collection and review are the building blocks of academic research regardless of the discipline. The gathered and reviewed data, however, need to be validated in order to obtain accurate information. The Delphi consensus is known as a method for validating the data. [...] Read more.
Data collection and review are the building blocks of academic research regardless of the discipline. The gathered and reviewed data, however, need to be validated in order to obtain accurate information. The Delphi consensus is known as a method for validating the data. However, several studies have shown that this method is time-consuming and requires a number of rounds to complete. Until now, there has been no clear evidence that validating data by a Delphi consensus is more significant than by a general consensus. In this regard, if data validation between both methods are not significantly different, then just using a general consensus method is sufficient, easier, and less time-consuming. Hence, this study aims to find out whether or not data validation by a Delphi consensus method is more significant than by a general consensus method. This study firstly collected and reviewed the data of sustainable building criteria, secondly validated these data by applying each consensus method, and finally made a comparison between both consensus methods. The results showed that seventeen of the valid criteria obtained from the general consensus and reduced by the Delphi consensus were found to be inconsistent for sustainable building assessments in Cambodia. Therefore, this study concludes that using the Delphi consensus method is more significant in validating the gathered and reviewed data. This experiment contributes to the selection and application of consensus methods in validating data, information, or criteria, especially in engineering fields. Full article
Show Figures

Figure 1

17 pages, 1337 KiB  
Article
VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models
by Andreas Burgdorf, Alexander Paulus, André Pomp and Tobias Meisen
Data 2022, 7(2), 17; https://doi.org/10.3390/data7020017 - 25 Jan 2022
Cited by 6 | Viewed by 3102
Abstract
Ontology-based data management and knowledge graphs have emerged in recent years as efficient approaches for managing and utilizing diverse and large data sets. In this regard, research on algorithms for automatic semantic labeling and modeling as a prerequisite for both has made steady [...] Read more.
Ontology-based data management and knowledge graphs have emerged in recent years as efficient approaches for managing and utilizing diverse and large data sets. In this regard, research on algorithms for automatic semantic labeling and modeling as a prerequisite for both has made steady progress in the form of new approaches. The range of algorithms varies in the type of information used (data schema, values, or metadata), as well as in the underlying methodology (e.g., use of different machine learning methods or external knowledge bases). Approaches that have been established over the years, however, still come with various weaknesses. Most approaches are evaluated on few small data corpora specific to the approach. This reduces comparability and also limits statements for the general applicability and performance of those approaches. Other research areas, such as computer vision or natural language processing solve this problem by providing unified data corpora for the evaluation of specific algorithms and tasks. In this paper, we present and publish VC-SLAM to lay the necessary foundation for future research. This corpus allows the evaluation and comparison of semantic labeling and modeling approaches across different methodologies, and it is the first corpus that additionally allows to leverage textual data documentations for semantic labeling and modeling. Each of the contained 101 data sets consists of labels, data and metadata, as well as corresponding semantic labels and a semantic model that were manually created by human experts using an ontology that was explicitly built for the corpus. We provide statistical information about the corpus as well as a critical discussion of its strengths and shortcomings, and test the corpus with existing methods for labeling and modeling. Full article
Show Figures

Figure 1

28 pages, 4245 KiB  
Article
Regression-Based Approach to Test Missing Data Mechanisms
by Serguei Rouzinov and André Berchtold
Data 2022, 7(2), 16; https://doi.org/10.3390/data7020016 - 25 Jan 2022
Cited by 7 | Viewed by 4641
Abstract
Missing data occur in almost all surveys; in order to handle them correctly it is essential to know their type. Missing data are generally divided into three types (or generating mechanisms): missing completely at random, missing at random, and missing not at random. [...] Read more.
Missing data occur in almost all surveys; in order to handle them correctly it is essential to know their type. Missing data are generally divided into three types (or generating mechanisms): missing completely at random, missing at random, and missing not at random. The first step to understand the type of missing data generally consists in testing whether the missing data are missing completely at random or not. Several tests have been developed for that purpose, but they have difficulties when dealing with non-continuous variables and data with a low quantity of missing data. Our approach checks whether the missing data are missing completely at random or missing at random using a regression model and a distribution test, and it can be applied to continuous and categorical data. The simulation results show that our regression-based approach tends to be more sensitive to the quantity and the type of missing data than the commonly used methods. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

16 pages, 2662 KiB  
Article
Managing FAIR Tribological Data Using Kadi4Mat
by Nico Brandt, Nikolay T. Garabedian, Ephraim Schoof, Paul J. Schreiber, Philipp Zschumme, Christian Greiner and Michael Selzer
Data 2022, 7(2), 15; https://doi.org/10.3390/data7020015 - 25 Jan 2022
Cited by 6 | Viewed by 4197
Abstract
The ever-increasing amount of data generated from experiments and simulations in engineering sciences is relying more and more on data science applications to generate new knowledge. Comprehensive metadata descriptions and a suitable research data infrastructure are essential prerequisites for these tasks. Experimental tribology, [...] Read more.
The ever-increasing amount of data generated from experiments and simulations in engineering sciences is relying more and more on data science applications to generate new knowledge. Comprehensive metadata descriptions and a suitable research data infrastructure are essential prerequisites for these tasks. Experimental tribology, in particular, presents some unique challenges in this regard due to the interdisciplinary nature of the field and the lack of existing standards. In this work, we demonstrate the versatility of the open source research data infrastructure Kadi4Mat by managing and producing FAIR tribological data. As a showcase example, a tribological experiment is conducted by an experimental group with a focus on comprehensiveness. The result is a FAIR data package containing all produced data as well as machine- and user-readable metadata. The close collaboration between tribologists and software developers shows a practical bottom-up approach and how such infrastructures are an essential part of our FAIR digital future. Full article
Show Figures

Figure 1

15 pages, 2494 KiB  
Article
Analysing Computer Science Courses over Time
by Renza Campagni, Donatella Merlini and Maria Cecilia Verri
Data 2022, 7(2), 14; https://doi.org/10.3390/data7020014 - 24 Jan 2022
Viewed by 2545
Abstract
In this paper we consider courses of a Computer Science degree in an Italian university from the year 2011 up to 2020. For each course, we know the number of exams taken by students during a given calendar year and the corresponding average [...] Read more.
In this paper we consider courses of a Computer Science degree in an Italian university from the year 2011 up to 2020. For each course, we know the number of exams taken by students during a given calendar year and the corresponding average grade; we also know the average normalized value of the result obtained in the entrance test and the distribution of students according to the gender. By using classification and clustering techniques, we analyze different data sets obtained by pre-processing the original data with information about students and their exams, and highlight which courses show a significant deviation from the typical progression of the courses of the same teaching year, as time changes. Finally, we give heat maps showing the order in which exams were taken by graduated students. The paper shows a reproducible methodology that can be applied to any degree course with a similar organization, to identify courses that present critical issues over time. A strength of the work is to consider courses over time as variables of interest, instead of the more frequently used personal and academic data concerning students. Full article
(This article belongs to the Special Issue Education Data Mining)
Show Figures

Figure 1

27 pages, 4208 KiB  
Data Descriptor
#PraCegoVer: A Large Dataset for Image Captioning in Portuguese
by Gabriel Oliveira dos Santos, Esther Luna Colombini and Sandra Avila
Data 2022, 7(2), 13; https://doi.org/10.3390/data7020013 - 21 Jan 2022
Cited by 5 | Viewed by 4599
Abstract
Automatically describing images using natural sentences is essential to visually impaired people’s inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in [...] Read more.
Automatically describing images using natural sentences is essential to visually impaired people’s inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in other languages are scarce. We introduce the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese. In contrast to popular datasets, #PraCegoVer has only one reference per image, and both mean and variance of reference sentence length are significantly high, which makes our dataset challenging due to its linguistic aspect. We carry a detailed analysis to find the main classes and topics in our data. We compare #PraCegoVer to MS COCO dataset in terms of sentence length and word frequency. We hope that #PraCegoVer dataset encourages more works addressing the automatic generation of descriptions in Portuguese. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

11 pages, 1554 KiB  
Article
Linking and Sharing Technology: Partnerships for Data Innovations for Management of Agricultural Big Data
by Tulsi P. Kharel, Amanda J. Ashworth and Phillip R. Owens
Data 2022, 7(2), 12; https://doi.org/10.3390/data7020012 - 20 Jan 2022
Cited by 8 | Viewed by 4012
Abstract
Combining data into a centralized, searchable, and linked platform will provide a data exploration platform to agricultural stakeholders and researchers for better agricultural decision making, thus fully utilizing existing data and preventing redundant research. Such a data repository requires readiness to share data, [...] Read more.
Combining data into a centralized, searchable, and linked platform will provide a data exploration platform to agricultural stakeholders and researchers for better agricultural decision making, thus fully utilizing existing data and preventing redundant research. Such a data repository requires readiness to share data, knowledge, and skillsets and working with Big Data infrastructures. With the adoption of new technologies and increased data collection, agricultural workforces need to update their knowledge, skills, and abilities. The partnerships for data innovation (PDI) effort integrates agricultural data by efficiently capturing them from field, lab, and greenhouse studies using a variety of sensors, tools, and apps and provides a quick visualization and summary of statistics for real-time decision making. This paper aims to evaluate and provide examples of case studies currently using PDI and use its long-term continental US database (18 locations and 24 years) to test the cover crop and grazing effects on soil organic carbon (SOC) storage. The results show that legume and rye (Secale cereale L.) cover crops increased SOC storage by 36% and 50%, respectively, compared with oat (Avena sativa L.) and rye mixtures and low and high grazing intensities improving the upper SOC by 69–72% compared with a medium grazing intensity. This was likely due to legumes providing a more favorable substrate for SOC formation and high grazing intensity systems having continuous manure deposition. Overall, PDI can be used to democratize data regionally and nationally and therefore can address large-scale research questions aimed at addressing agricultural grand challenges. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop