Data | January 2022 - Browse Articles

22 pages, 6057 KiB

Open AccessArticle

An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data

by Mohamed Reda Al-Bana, Marwa Salah Farhan and Nermin Abdelhakim Othman

Data 2022, 7(1), 11; https://doi.org/10.3390/data7010011 - 14 Jan 2022

Cited by 10 | Viewed by 6350

Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan [...] Read more.

Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities. Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout. The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning and has information that helps to find each itemset support. In a vertical layout, itemset support can be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However, when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduce SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes both horizontal and vertical layout diffset instead of tidset to keep track of the differences between transaction ids rather than the intersections. Moreover, some improvements are developed to decrease the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework, which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of execution time. Full article

(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)

► Show Figures

Figure 1

20 pages, 604 KiB

Open AccessArticle

The Impact of Global Structural Information in Graph Neural Networks Applications

by Davide Buffelli and Fabio Vandin

Data 2022, 7(1), 10; https://doi.org/10.3390/data7010010 - 13 Jan 2022

Cited by 3 | Viewed by 3670

Abstract

Graph Neural Networks (GNNs) rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours. A known limitation of GNNs is that, as the number of layers increases, information gets smoothed and [...] Read more.

Graph Neural Networks (GNNs) rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours. A known limitation of GNNs is that, as the number of layers increases, information gets smoothed and squashed and node embeddings become indistinguishable, negatively affecting performance. Therefore, practical GNN models employ few layers and only leverage the graph structure in terms of limited, small neighbourhoods around each node. Inevitably, practical GNNs do not capture information depending on the global structure of the graph. While there have been several works studying the limitations and expressivity of GNNs, the question of whether practical applications on graph structured data require global structural knowledge or not remains unanswered. In this work, we empirically address this question by giving access to global information to several GNN models, and observing the impact it has on downstream performance. Our results show that global information can in fact provide significant benefits for common graph-related tasks. We further identify a novel regularization strategy that leads to an average accuracy improvement of more than

5 %

on all considered tasks. Full article

(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)

► Show Figures

Graphical abstract

10 pages, 889 KiB

Open AccessEditor’s ChoiceData Descriptor

A Repertoire of Virtual-Reality, Occupational Therapy Exercises for Motor Rehabilitation Based on Action Observation

by Emilia Scalona, Doriana De Marco, Maria Chiara Bazzini, Arturo Nuara, Adolfo Zilli, Elisa Taglione, Fabrizio Pasqualetti, Generoso Della Polla, Nicola Francesco Lopomo, Maddalena Fabbri-Destro and Pietro Avanzini

Data 2022, 7(1), 9; https://doi.org/10.3390/data7010009 - 11 Jan 2022

Cited by 2 | Viewed by 3616

Abstract

There is a growing interest in action observation treatment (AOT), i.e., a rehabilitative procedure combining action observation, motor imagery, and action execution to promote the recovery, maintenance, and acquisition of motor abilities. AOT studies employed basic upper limb gestures as stimuli, but—in principle—the [...] Read more.

There is a growing interest in action observation treatment (AOT), i.e., a rehabilitative procedure combining action observation, motor imagery, and action execution to promote the recovery, maintenance, and acquisition of motor abilities. AOT studies employed basic upper limb gestures as stimuli, but—in principle—the AOT approach can be effectively extended to more complex actions like occupational gestures. Here, we present a repertoire of virtual-reality (VR) stimuli depicting occupational therapy exercises intended for AOT, potentially suitable for occupational safety and injury prevention. We animated a humanoid avatar by fitting the kinematics recorded by a healthy subject performing the exercises. All the stimuli are available via a custom-made graphical user interface, which allows the user to adjust several visualization parameters like the viewpoint, the number of repetitions, and the observed movement’s speed. Beyond providing clinicians with a set of VR stimuli promoting via AOT the recovery of goal-oriented, occupational gestures, such a repertoire could extend the use of AOT to the field of occupational safety and injury prevention. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

27 pages, 10662 KiB

Open AccessData Descriptor

TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

by Muhammad Imran, Umair Qazi and Ferda Ofli

Data 2022, 7(1), 8; https://doi.org/10.3390/data7010008 - 10 Jan 2022

Cited by 27 | Viewed by 5907

Abstract

As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread [...] Read more.

As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights. We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

42 pages, 6853 KiB

Open AccessArticle

Knowledge Management Model for Smart Campus in Indonesia

by Deden Sumirat Hidayat and Dana Indra Sensuse

Data 2022, 7(1), 7; https://doi.org/10.3390/data7010007 - 10 Jan 2022

Cited by 14 | Viewed by 6494

Abstract

The application of smart campuses (SC), especially at higher education institutions (HEI) in Indonesia, is very diverse, and does not yet have standards. As a result, SC practice is spread across various areas in an unstructured and uneven manner. KM is one of [...] Read more.

The application of smart campuses (SC), especially at higher education institutions (HEI) in Indonesia, is very diverse, and does not yet have standards. As a result, SC practice is spread across various areas in an unstructured and uneven manner. KM is one of the critical components of SC. However, the use of KM to support SC is less clearly discussed. Most implementations and assumptions still consider the latest IT application as the SC component. As such, this study aims to identify the components of the KM model for SC. This study used a systematic literature review (SLR) technique with PRISMA procedures, an analytical hierarchy process, and expert interviews. SLR is used to identify the components of the conceptual model, and AHP is used for model priority component analysis. Interviews were used for validation and model development. The results show that KM, IoT, and big data have the highest trends. Governance, people, and smart education have the highest trends. IT is the highest priority component. The KM model for SC has five main layers grouped in phases of the system cycle. This cycle describes the organization’s intellectual ability to adapt in achieving SC indicators. The knowledge cycle at HEIs focuses on education, research, and community service. Full article

(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)

► Show Figures

Figure 1

13 pages, 5522 KiB

Open AccessData Descriptor

Multi-Temporal Surface Water Classification for Four Major Rivers from the Peruvian Amazon

by Margaret Kalacska, J. Pablo Arroyo-Mora, Oliver T. Coomes, Yoshito Takasaki and Christian Abizaid

Data 2022, 7(1), 6; https://doi.org/10.3390/data7010006 - 6 Jan 2022

Cited by 5 | Viewed by 3033

Abstract

We describe a new minimum extent, persistent surface water classification for reaches of four major rivers in the Peruvian Amazon (i.e., Amazon, Napo, Pastaza, Ucayali). These data were generated by the Peruvian Amazon Rural Livelihoods and Poverty (PARLAP) Project which aims to better [...] Read more.

We describe a new minimum extent, persistent surface water classification for reaches of four major rivers in the Peruvian Amazon (i.e., Amazon, Napo, Pastaza, Ucayali). These data were generated by the Peruvian Amazon Rural Livelihoods and Poverty (PARLAP) Project which aims to better understand the nexus between livelihoods (e.g., fishing, agriculture, forest use, trade), poverty, and conservation in the Peruvian Amazon over a 35,000 km river network. Previous surface water datasets do not adequately capture the temporal changes in the course of the rivers, nor discriminate between primary main channel and non-main channel (e.g., oxbow lakes) water. We generated the surface water classifications in Google Earth Engine from Landsat TM 5, 7 ETM+, and 8 OLI satellite imagery for time periods from circa 1989, 2000, and 2015 using a hierarchical logical binary classification predominantly based on a modified Normalized Difference Water Index (mNDWI) and shortwave infrared surface reflectance. We included surface reflectance in the blue band and brightness temperature to minimize misclassification. High accuracies were achieved for all time periods (>90%). Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

18 pages, 556 KiB

Open AccessData Descriptor

Open Government Data Use in the Brazilian States and Federal District Public Administrations

by Ilka Kawashita, Ana Alice Baptista and Delfina Soares

Data 2022, 7(1), 5; https://doi.org/10.3390/data7010005 - 5 Jan 2022

Cited by 4 | Viewed by 3154

Abstract

This research investigates whether, why, and how open government data (OGD) is used and reused by Brazilian state and district public administrations. A new online questionnaire was developed and collected data from 26 of the 27 federation units between June and July 2021. [...] Read more.

This research investigates whether, why, and how open government data (OGD) is used and reused by Brazilian state and district public administrations. A new online questionnaire was developed and collected data from 26 of the 27 federation units between June and July 2021. The resulting dataset was cleaned and anonymized. It contains an insight on 158 parameters for 26 federation units explored. This article describes the questionnaire metadata and the methods applied to collect and treat data. The data file was divided into four sections: respondent profile (identify the respondent and his workplace), OGD use/consumption, what OGD is used for by public administrations, and why OGD is used by public administrations (benefits, barriers, drivers, and barriers to OGD use/reuse). Results provide the state of the play of OGD use/reuse in the federation units administrations. Therefore, they could be used to inform open data policy and decision-making processes. Furthermore, they could be the starting point for discussing how OGD could better support the digital transformation in the public sector. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

14 pages, 4721 KiB

Open AccessEditor’s ChoiceArticle

View VULMA: Data Set for Training a Machine-Learning Tool for a Fast Vulnerability Analysis of Existing Buildings

by Angelo Cardellicchio, Sergio Ruggieri, Valeria Leggieri and Giuseppina Uva

Data 2022, 7(1), 4; https://doi.org/10.3390/data7010004 - 31 Dec 2021

Cited by 22 | Viewed by 3958

Abstract

The paper presents View VULMA, a data set specifically designed for training machine-learning tools for elaborating fast vulnerability analysis of existing buildings. Such tools require supervised training via an extensive set of building imagery, for which several typological parameters should be defined, [...] Read more.

The paper presents View VULMA, a data set specifically designed for training machine-learning tools for elaborating fast vulnerability analysis of existing buildings. Such tools require supervised training via an extensive set of building imagery, for which several typological parameters should be defined, with a proper label assigned to each sample on a per-parameter basis. Thus, it is clear how defining an adequate training data set plays a key role, and several aspects should be considered, such as data availability, preprocessing, augmentation and balancing according to the selected labels. In this paper, we highlight all these issues, describing the pursued strategies to elaborate a reliable data set. In particular, a detailed description of both requirements (e.g., scale and resolution of images, evaluation parameters and data heterogeneity) and the steps followed to define View VULMA are provided, starting from the data assessment (which allowed to reduce the initial sample of about 20.000 images to a subset of about 3.000 pictures), to achieve the goal of training a transfer-learning-based automated tool for fast estimation of the vulnerability of existing buildings from single pictures. Full article

► Show Figures

Figure 1

20 pages, 953 KiB

Open AccessArticle

News Monitor: A Framework for Exploring News in Real-Time

by Nikolaos Panagiotou, Antonia Saravanou and Dimitrios Gunopulos

Data 2022, 7(1), 3; https://doi.org/10.3390/data7010003 - 27 Dec 2021

Cited by 6 | Viewed by 4591

Abstract

News articles generated by online media are a major source of information. In this work, we present News Monitor, a framework that automatically collects news articles from a wide variety of online news portals and performs various analysis tasks. The framework initially identifies [...] Read more.

News articles generated by online media are a major source of information. In this work, we present News Monitor, a framework that automatically collects news articles from a wide variety of online news portals and performs various analysis tasks. The framework initially identifies fresh news (first stories) and clusters articles about the same incidents. For every story, at first, it extracts all of the corresponding triples and, then, it creates a knowledge base (KB) using open information extraction techniques. This knowledge base is then used to create a summary for the user. News Monitor allows for the users to use it as a search engine, ask their questions in their natural language and receive answers that have been created by the state-of-the-art framework BERT. In addition, News Monitor crawls the Twitter stream using a dynamic set of “trending” keywords in order to retrieve all messages relevant to the news. The framework is distributed, online and performs analysis in real-time. According to the evaluation results, the fake news detection techniques utilized by News Monitor allow for a F-measure of 82% in the rumor identification task and an accuracy of 92% in the stance detection tasks. The major contribution of this work can be summarized as a novel real-time and scalable architecture that combines various effective techniques under a news analysis framework. Full article

(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)

► Show Figures

Figure 1

14 pages, 1717 KiB

Open AccessArticle

Business Intelligence for IT Governance of a Technology Company

by Vittoria Biagi, Riccardo Patriarca and Giulio Di Gravio

Data 2022, 7(1), 2; https://doi.org/10.3390/data7010002 - 27 Dec 2021

Cited by 5 | Viewed by 4982

Abstract

Managers are required to make fast, reliable, and fact-based decisions to encompass the dynamicity of modern business environments. Data visualization and reporting are thus crucial activities to ensure a systematic organizational intelligence especially for technological companies operating in a fast-moving context. As such, [...] Read more.

Managers are required to make fast, reliable, and fact-based decisions to encompass the dynamicity of modern business environments. Data visualization and reporting are thus crucial activities to ensure a systematic organizational intelligence especially for technological companies operating in a fast-moving context. As such, this paper presents case-study research for the definition of a business intelligence model and related Key Performance Indicators (KPIs) to support risk-related decision making. The study firstly comprises a literature review on approaches for governance management, which confirm a disconnection between theory and practice. It then progresses to mapping the main business areas and suggesting exemplary KPIs to fill this gap. Finally, it documents the design and usage of a BI dashboard, as emerged via a validation with four managers. This early application shows the advantages of BI for both business operators and governance managers. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

16 pages, 1275 KiB

Open AccessData Descriptor

Datasets for the Determination of Evaporative Flux from Distilled Water and Saturated Brine Using Bench-Scale Atmospheric Simulators

by Jared Suchan and Shahid Azam

Data 2022, 7(1), 1; https://doi.org/10.3390/data7010001 - 22 Dec 2021

Cited by 1 | Viewed by 2421

Abstract

Evaporation from fresh water and saline water is critical for the estimation of water budget in the Canadian Prairies. Predictive models using empirical field-based data are subject to significant errors and uncertainty. Therefore, highly controlled test conditions and accurately measured experimental data are [...] Read more.

Evaporation from fresh water and saline water is critical for the estimation of water budget in the Canadian Prairies. Predictive models using empirical field-based data are subject to significant errors and uncertainty. Therefore, highly controlled test conditions and accurately measured experimental data are required to understand the relationship between atmospheric variables at water surfaces. This paper provides a comprehensive dataset generated for the determination of evaporative flux from distilled water and saturated brine using the bench-scale atmospheric simulator (BAS) and the subsequently improved design (BAS2). Analyses of the weather scenarios from atmospheric parameters and evaporative flux from the experimental data are provided. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Data, Volume 7, Issue 1 (January 2022) – 11 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI