Big Data and Cognitive Computing

23 pages, 830 KiB

Open AccessEditor’s ChoiceArticle

Service Oriented R-ANN Knowledge Model for Social Internet of Things

by Mohana S. D., S. P. Shiva Prakash and Kirill Krinkin

Big Data Cogn. Comput. 2022, 6(1), 32; https://doi.org/10.3390/bdcc6010032 - 18 Mar 2022

Cited by 7 | Viewed by 3751

Increase in technologies around the world requires adding intelligence to the objects, and making it a smart object in an environment leads to the Social Internet of Things (SIoT). These social objects are uniquely identifiable, transferable and share information from user-to-objects and objects-to [...] Read more.

Increase in technologies around the world requires adding intelligence to the objects, and making it a smart object in an environment leads to the Social Internet of Things (SIoT). These social objects are uniquely identifiable, transferable and share information from user-to-objects and objects-to objects through interactions in a smart environment such as smart homes, smart cities and many more applications. SIoT faces certain challenges such as handling of heterogeneous objects, selection of generated data in objects, missing values in data. Therefore, the discovery and communication of meaningful patterns in data are more important for every application. Thus, the analysis of data is essential in smarter decisions and qualifies performance of data for various applications. In a smart environment, social networks of intelligent objects are increasing services and decreasing the relationship in a reliable and efficient way of sharing resources and services. Hence, this work proposed the feature selection method based on proposed semantic rules and established the relationships to classify the services using relationship artificial neural networks (R-ANN). R-ANN is an inversely proportional relationship to the objects based on certain rules and conditions between the objects to objects and users to objects. It provides the service oriented knowledge model to make decisions in the proposed R-ANN model that produces service to the users. The proposed R-ANN provides an accuracy of 89.62% for various services namely weather, air quality, parking, light status, and people presence respectively in the SIoT environment compared to the existing model. Full article

► Show Figures

Figure 1

22 pages, 696 KiB

Open AccessArticle

Factors Influencing Citizens’ Intention to Use Open Government Data—A Case Study of Pakistan

by Muhammad Mahboob Khurshid, Nor Hidayati Zakaria, Muhammad Irfanullah Arfeen, Ammar Rashid, Safi Ullah Nasir and Hafiz Muhammad Faisal Shehzad

Big Data Cogn. Comput. 2022, 6(1), 31; https://doi.org/10.3390/bdcc6010031 - 17 Mar 2022

Cited by 24 | Viewed by 5102

Abstract

Open government data (OGD) has gained much attention worldwide; however, there is still an increasing demand for exploring research from the perspective of its adoption and diffusion. Policymakers expect that OGD will be used on a large scale by the public, which will [...] Read more.

Open government data (OGD) has gained much attention worldwide; however, there is still an increasing demand for exploring research from the perspective of its adoption and diffusion. Policymakers expect that OGD will be used on a large scale by the public, which will result in a range of benefits, such as: faith and trust in governments, innovation and development, and participatory governance. However, not much is known about which factors influence the citizens’ intention to use OGD. Therefore, this research aims at empirically investigating the factors that influence citizens’ intention to use OGD in a developing country using information systems theory. Improved knowledge and understanding of the influencing factors can assist policymakers in determining which policy initiatives they can take to increase the intention to widely use OGD. Upon conducting a survey and performing analysis, findings reveal that perceived usefulness, social approval, and enjoyment positively influences intention, whereas voluntariness of use negatively influences OGD use. Further, perceived usefulness is significantly affected by perceived ease of use, and OGD use is significantly affected by OGD use intention. However, surprisingly, the intention to use OGD is not significantly affected by perceived ease of use. The policymakers suggest increasing the intention to use OGD by considering significant factors. Full article

► Show Figures

Figure 1

17 pages, 2573 KiB

Open AccessEditor’s ChoiceArticle

Big Data Management in Drug–Drug Interaction: A Modern Deep Learning Approach for Smart Healthcare

by Muhammad Salman, Hafiz Suliman Munawar, Khalid Latif, Muhammad Waseem Akram, Sara Imran Khan and Fahim Ullah

Big Data Cogn. Comput. 2022, 6(1), 30; https://doi.org/10.3390/bdcc6010030 - 9 Mar 2022

Cited by 10 | Viewed by 6626

Abstract

The detection and classification of drug–drug interactions (DDI) from existing data are of high importance because recent reports show that DDIs are among the major causes of hospital-acquired conditions and readmissions and are also necessary for smart healthcare. Therefore, to avoid adverse drug [...] Read more.

The detection and classification of drug–drug interactions (DDI) from existing data are of high importance because recent reports show that DDIs are among the major causes of hospital-acquired conditions and readmissions and are also necessary for smart healthcare. Therefore, to avoid adverse drug interactions, it is necessary to have an up-to-date knowledge of DDIs. This knowledge could be extracted by applying text-processing techniques to the medical literature published in the form of ‘Big Data’ because, whenever a drug interaction is investigated, it is typically reported and published in healthcare and clinical pharmacology journals. However, it is crucial to automate the extraction of the interactions taking place between drugs because the medical literature is being published in immense volumes, and it is impossible for healthcare professionals to read and collect all of the investigated DDI reports from these Big Data. To avoid this time-consuming procedure, the Information Extraction (IE) and Relationship Extraction (RE) techniques that have been studied in depth in Natural Language Processing (NLP) could be very promising. Since 2011, a lot of research has been reported in this particular area, and there are many approaches that have been implemented that can also be applied to biomedical texts to extract DDI-related information. A benchmark corpus is also publicly available for the advancement of DDI extraction tasks. The current state-of-the-art implementations for extracting DDIs from biomedical texts has employed Support Vector Machines (SVM) or other machine learning methods that work on manually defined features and that might be the cause of the low precision and recall that have been achieved in this domain so far. Modern deep learning techniques have also been applied for the automatic extraction of DDIs from the scientific literature and have proven to be very promising for the advancement of DDI extraction tasks. As such, it is pertinent to investigate deep learning techniques for the extraction and classification of DDIs in order for them to be used in the smart healthcare domain. We proposed a deep neural network-based method (SEV-DDI: Severity-Drug–Drug Interaction) with some further-integrated units/layers to achieve higher precision and accuracy. After successfully outperforming other methods in the DDI classification task, we moved a step further and utilized the methods in a sentiment analysis task to investigate the severity of an interaction. The ability to determine the severity of a DDI will be very helpful for clinical decision support systems in making more accurate and informed decisions, ensuring the safety of the patients. Full article

► Show Figures

Figure 1

29 pages, 2381 KiB

Open AccessEditor’s ChoiceArticle

Radiology Imaging Scans for Early Diagnosis of Kidney Tumors: A Review of Data Analytics-Based Machine Learning and Deep Learning Approaches

by Maha Gharaibeh, Dalia Alzu’bi, Malak Abdullah, Ismail Hmeidi, Mohammad Rustom Al Nasar, Laith Abualigah and Amir H. Gandomi

Big Data Cogn. Comput. 2022, 6(1), 29; https://doi.org/10.3390/bdcc6010029 - 8 Mar 2022

Cited by 39 | Viewed by 13722

Abstract

Plenty of disease types exist in world communities that can be explained by humans’ lifestyles or the economic, social, genetic, and other factors of the country of residence. Recently, most research has focused on studying common diseases in the population to reduce death [...] Read more.

Plenty of disease types exist in world communities that can be explained by humans’ lifestyles or the economic, social, genetic, and other factors of the country of residence. Recently, most research has focused on studying common diseases in the population to reduce death risks, take the best procedure for treatment, and enhance the healthcare level of the communities. Kidney Disease is one of the common diseases that have affected our societies. Sectionicularly Kidney Tumors (KT) are the 10th most prevalent tumor for men and women worldwide. Overall, the lifetime likelihood of developing a kidney tumor for males is about 1 in 466 (2.02 percent) and it is around 1 in 80 (1.03 percent) for females. Still, more research is needed on new diagnostic, early, and innovative methods regarding finding an appropriate treatment method for KT. Compared to the tedious and time-consuming traditional diagnosis, automatic detection algorithms of machine learning can save diagnosis time, improve test accuracy, and reduce costs. Previous studies have shown that deep learning can play a role in dealing with complex tasks, diagnosis and segmentation, and classification of Kidney Tumors, one of the most malignant tumors. The goals of this review article on deep learning in radiology imaging are to summarize what has already been accomplished, determine the techniques used by the researchers in previous years in diagnosing Kidney Tumors through medical imaging, and identify some promising future avenues, whether in terms of applications or technological developments, as well as identifying common problems, describing ways to expand the data set, summarizing the knowledge and best practices, and determining remaining challenges and future directions. Full article

(This article belongs to the Collection Machine Learning and Artificial Intelligence for Health Applications on Social Networks)

► Show Figures

Figure 1

19 pages, 14228 KiB

Open AccessEditor’s ChoiceArticle

Comparison of Object Detection in Head-Mounted and Desktop Displays for Congruent and Incongruent Environments

by René Reinhard, Erinchan Telatar and Shah Rukh Humayoun

Big Data Cogn. Comput. 2022, 6(1), 28; https://doi.org/10.3390/bdcc6010028 - 7 Mar 2022

Cited by 3 | Viewed by 3431

Abstract

Virtual reality technologies, including head-mounted displays (HMD), can provide benefits to psychological research by combining high degrees of experimental control with improved ecological validity. This is due to the strong feeling of being in the displayed environment (presence) experienced by VR users. As [...] Read more.

Virtual reality technologies, including head-mounted displays (HMD), can provide benefits to psychological research by combining high degrees of experimental control with improved ecological validity. This is due to the strong feeling of being in the displayed environment (presence) experienced by VR users. As of yet, it is not fully explored how using HMDs impacts basic perceptual tasks, such as object perception. In traditional display setups, the congruency between background environment and object category has been shown to impact response times in object perception tasks. In this study, we investigated whether this well-established effect is comparable when using desktop and HMD devices. In the study, 21 participants used both desktop and HMD setups to perform an object identification task and, subsequently, their subjective presence while experiencing two-distinct virtual environments (a beach and a home environment) was evaluated. Participants were quicker to identify objects in the HMD condition, independent of object-environment congruency, while congruency effects were not impacted. Furthermore, participants reported significantly higher presence in the HMD condition. Full article

(This article belongs to the Special Issue Virtual Reality, Augmented Reality, and Human-Computer Interaction)

► Show Figures

Figure 1

42 pages, 679 KiB

Open AccessEditor’s ChoiceArticle

Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0

by Anna Kirkpatrick, Chidozie Onyeze, David Kartchner, Stephen Allegri, Davi Nakajima An, Kevin McCoy, Evie Davalbhakta and Cassie S. Mitchell

Big Data Cogn. Comput. 2022, 6(1), 27; https://doi.org/10.3390/bdcc6010027 - 1 Mar 2022

Cited by 8 | Viewed by 4858

Abstract

Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way [...] Read more.

Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer’s disease and metabolic co-morbidities. Full article

(This article belongs to the Special Issue Graph-Based Data Mining and Social Network Analysis)

► Show Figures

Figure 1

28 pages, 1211 KiB

Open AccessEditor’s ChoiceArticle

A Combined System Metrics Approach to Cloud Service Reliability Using Artificial Intelligence

by Tek Raj Chhetri, Chinmaya Kumar Dehury, Artjom Lind, Satish Narayana Srirama and Anna Fensel

Big Data Cogn. Comput. 2022, 6(1), 26; https://doi.org/10.3390/bdcc6010026 - 1 Mar 2022

Cited by 4 | Viewed by 4710

Abstract

Identifying and anticipating potential failures in the cloud is an effective method for increasing cloud reliability and proactive failure management. Many studies have been conducted to predict potential failure, but none have combined SMART (self-monitoring, analysis, and reporting technology) hard drive metrics with [...] Read more.

Identifying and anticipating potential failures in the cloud is an effective method for increasing cloud reliability and proactive failure management. Many studies have been conducted to predict potential failure, but none have combined SMART (self-monitoring, analysis, and reporting technology) hard drive metrics with other system metrics, such as central processing unit (CPU) utilisation. Therefore, we propose a combined system metrics approach for failure prediction based on artificial intelligence to improve reliability. We tested over 100 cloud servers’ data and four artificial intelligence algorithms: random forest, gradient boosting, long short-term memory, and gated recurrent unit, and also performed correlation analysis. Our correlation analysis sheds light on the relationships that exist between system metrics and failure, and the experimental results demonstrate the advantages of combining system metrics, outperforming the state-of-the-art. Full article

(This article belongs to the Special Issue Advanced Machine Learning and Data Mining: A New Frontier in Artificial Intelligence Research)

► Show Figures

Figure 1

24 pages, 1832 KiB

Open AccessEditor’s ChoiceReview

Big Data in Criteria Selection and Identification in Managing Flood Disaster Events Based on Macro Domain PESTEL Analysis: Case Study of Malaysia Adaptation Index

by Mohammad Fikry Abdullah, Zurina Zainol, Siaw Yin Thian, Noor Hisham Ab Ghani, Azman Mat Jusoh, Mohd Zaki Mat Amin and Nur Aiza Mohamad

Big Data Cogn. Comput. 2022, 6(1), 25; https://doi.org/10.3390/bdcc6010025 - 1 Mar 2022

Cited by 7 | Viewed by 7981

Abstract

The impact of Big Data (BD) creates challenges in selecting relevant and significant data to be used as criteria to facilitate flood management plans. Studies on macro domain criteria expand the criteria selection, which is important for assessment in allowing a comprehensive understanding [...] Read more.

The impact of Big Data (BD) creates challenges in selecting relevant and significant data to be used as criteria to facilitate flood management plans. Studies on macro domain criteria expand the criteria selection, which is important for assessment in allowing a comprehensive understanding of the current situation, readiness, preparation, resources, and others for decision assessment and disaster events planning. This study aims to facilitate the criteria identification and selection from a macro domain perspective in improving flood management planning. The objectives of this study are (a) to explore and identify potential and possible criteria to be incorporated in the current flood management plan in the macro domain perspective; (b) to understand the type of flood measures and decision goals implemented to facilitate flood management planning decisions; and (c) to examine the possible structured mechanism for criteria selection based on the decision analysis technique. Based on a systematic literature review and thematic analysis using the PESTEL framework, the findings have identified and clustered domains and their criteria to be considered and applied in future flood management plans. The critical review on flood measures and decision goals would potentially equip stakeholders and policy makers for better decision making based on a disaster management plan. The decision analysis technique as a structured mechanism would significantly improve criteria identification and selection for comprehensive and collective decisions. The findings from this study could further improve Malaysia Adaptation Index (MAIN) criteria identification and selection, which could be the complementary and supporting reference in managing flood disaster management. A proposed framework from this study can be used as guidance in dealing with and optimising the criteria based on challenges and the current application of Big Data and criteria in managing disaster events. Full article

► Show Figures

Figure 1

22 pages, 628 KiB

Open AccessFeature PaperArticle

Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis

by Jogeswar Tripathy, Rasmita Dash, Binod Kumar Pattanayak, Sambit Kumar Mishra, Tapas Kumar Mishra and Deepak Puthal

Big Data Cogn. Comput. 2022, 6(1), 24; https://doi.org/10.3390/bdcc6010024 - 23 Feb 2022

Cited by 15 | Viewed by 3933

Abstract

In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few [...] Read more.

In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all. Full article

(This article belongs to the Special Issue Data, Structure, and Information in Artificial Intelligence)

► Show Figures

Figure 1

27 pages, 1296 KiB

Open AccessEditor’s ChoiceArticle

A Framework for Content-Based Search in Large Music Collections

by Tiange Zhu, Raphaël Fournier-S’niehotta, Philippe Rigaux and Nicolas Travers

Big Data Cogn. Comput. 2022, 6(1), 23; https://doi.org/10.3390/bdcc6010023 - 23 Feb 2022

Cited by 6 | Viewed by 4432

Abstract

We address the problem of scalable content-based search in large collections of music documents. Music content is highly complex and versatile and presents multiple facets that can be considered independently or in combination. Moreover, music documents can be digitally encoded in many ways. [...] Read more.

We address the problem of scalable content-based search in large collections of music documents. Music content is highly complex and versatile and presents multiple facets that can be considered independently or in combination. Moreover, music documents can be digitally encoded in many ways. We propose a general framework for building a scalable search engine, based on (i) a music description language that represents music content independently from a specific encoding, (ii) an extendible list of feature-extraction functions, and (iii) indexing, searching, and ranking procedures designed to be integrated into the standard architecture of a text-oriented search engine. As a proof of concept, we also detail an actual implementation of the framework for searching in large collections of XML-encoded music scores, based on the popular ElasticSearch system. It is released as open-source in GitHub, and available as a ready-to-use Docker image for communities that manage large collections of digitized music documents. Full article

(This article belongs to the Special Issue Big Music Data)

► Show Figures

Figure 1

15 pages, 347 KiB

Open AccessEditor’s ChoiceArticle

LFA: A Lévy Walk and Firefly-Based Search Algorithm: Application to Multi-Target Search and Multi-Robot Foraging

by Ouarda Zedadra, Antonio Guerrieri and Hamid Seridi

Big Data Cogn. Comput. 2022, 6(1), 22; https://doi.org/10.3390/bdcc6010022 - 21 Feb 2022

Cited by 7 | Viewed by 3721

Abstract

In the literature, several exploration algorithms have been proposed so far. Among these, Lévy walk is commonly used since it is proved to be more efficient than the simple random-walk exploration. It is beneficial when targets are sparsely distributed in the search space. [...] Read more.

In the literature, several exploration algorithms have been proposed so far. Among these, Lévy walk is commonly used since it is proved to be more efficient than the simple random-walk exploration. It is beneficial when targets are sparsely distributed in the search space. However, due to its super-diffusive behavior, some tuning is needed to improve its performance, specifically when targets are clustered. Firefly algorithm is a swarm intelligence-based algorithm useful for intensive search, but its exploration rate is very limited. An efficient and reliable search could be attained by combining the two algorithms since the first one allows exploration space, and the second one encourages its exploitation. In this paper, we propose a swarm intelligence-based search algorithm called Lévy walk and Firefly-based Algorithm (LFA), which is a hybridization of the two aforementioned algorithms. The algorithm is applied to Multi-Target Search and Multi-Robot Foraging. Numerical experiments to test the performances are conducted on the robotic simulator ARGoS. A comparison with the original firefly algorithm proves the goodness of our contribution. Full article

(This article belongs to the Special Issue Big Data and Cognitive Computing: 5th Anniversary Feature Papers)

► Show Figures

Figure 1

17 pages, 659 KiB

Open AccessArticle

Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study

by Amna Dridi, Mohamed Medhat Gaber, Raja Muhammad Atif Azad and Jagdev Bhogal

Big Data Cogn. Comput. 2022, 6(1), 21; https://doi.org/10.3390/bdcc6010021 - 21 Feb 2022

Cited by 1 | Viewed by 4515

Abstract

The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data [...] Read more.

The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data science and the development of increasingly powerful computers. Among these approaches, some works have studied dynamism in scientific literature by employing text analysis techniques that rely on topic models to study the dynamics of research topics. Unlike topic models that do not delve deeper into the content of scientific publications, for the first time, this paper uses temporal word embeddings to automatically track the dynamics of scientific keywords over time. To this end, we propose Vec2Dynamics, a neural-based computational history approach that reports stability of k-nearest neighbors of scientific keywords over time; the stability indicates whether the keywords are taking new neighborhood due to evolution of scientific literature. To evaluate how Vec2Dynamics models such relationships in the domain of Machine Learning (ML), we constructed scientific corpora from the papers published in the Neural Information Processing Systems (NIPS; actually abbreviated NeurIPS) conference between 1987 and 2016. The descriptive analysis that we performed in this paper verify the efficacy of our proposed approach. In fact, we found a generally strong consistency between the obtained results and the Machine Learning timeline. Full article

(This article belongs to the Special Issue Machine Learning for Dependable Edge Computing Systems and Services)

► Show Figures

Figure 1

17 pages, 3098 KiB

Open AccessArticle

Person Re-Identification via Pyramid Multipart Features and Multi-Attention Framework

by Randa Mohamed Bayoumi, Elsayed E. Hemayed, Mohammad Ehab Ragab and Magda B. Fayek

Big Data Cogn. Comput. 2022, 6(1), 20; https://doi.org/10.3390/bdcc6010020 - 9 Feb 2022

Cited by 4 | Viewed by 3354

Abstract

Video-based person re-identification has become quite attractive due to its importance in many vision surveillance problems. It is a challenging topic due to the inter/intra changes, occlusion, and pose variations involved. In this paper, we propose a pyramid-attentive framework that relies on multi-part [...] Read more.

Video-based person re-identification has become quite attractive due to its importance in many vision surveillance problems. It is a challenging topic due to the inter/intra changes, occlusion, and pose variations involved. In this paper, we propose a pyramid-attentive framework that relies on multi-part features and multiple attention to aggregate features of multi-levels and learns attention-based representations of persons through various aspects. Self-attention is used to strengthen the most discriminative features in the spatial and channel domains and hence capture robust global information. We propose the use of part-relation attention between different multi-granularities of features’ representation to focus on learning appropriate local features. Temporal attention is used to aggregate temporal features. We integrate the most robust features in the global and multi-level views to build an effective convolution neural network (CNN) model. The proposed model outperforms the previous state-of-the art models on three datasets. Notably, using the proposed model enables the achievement of 98.9% (a relative improvement of 2.7% on the GRL) top1 accuracy and 99.3% mAP on the PRID2011, and 92.8% (a relative improvement of 2.4% relative to GRL) top1 accuracy on iLIDS-vid. We also explore the generalization ability of our model on a cross dataset. Full article

(This article belongs to the Topic Machine and Deep Learning)

► Show Figures

Figure 1

15 pages, 3985 KiB

Open AccessArticle

The Next-Generation NIDS Platform: Cloud-Based Snort NIDS Using Containers and Big Data

by Ferry Astika Saputra, Muhammad Salman, Jauari Akhmad Nur Hasim, Isbat Uzzin Nadhori and Kalamullah Ramli

Big Data Cogn. Comput. 2022, 6(1), 19; https://doi.org/10.3390/bdcc6010019 - 7 Feb 2022

Cited by 2 | Viewed by 6628

Abstract

Snort is a well-known, signature-based network intrusion detection system (NIDS). The Snort sensor must be placed within the same physical network, and the defense centers in the typical NIDS architecture offer limited network coverage, especially for remote networks with a restricted bandwidth and [...] Read more.

Snort is a well-known, signature-based network intrusion detection system (NIDS). The Snort sensor must be placed within the same physical network, and the defense centers in the typical NIDS architecture offer limited network coverage, especially for remote networks with a restricted bandwidth and network policy. Additionally, the growing number of sensor instances, followed by a quick increase in log data volume, has caused the present system to face big data challenges. This research paper proposes a novel design for a cloud-based Snort NIDS using containers and implementing big data in the defense center to overcome these problems. Our design consists of Docker as the sensor’s platform, Apache Kafka, as the distributed messaging system, and big data technology orchestrated on lambda architecture. We conducted experiments to measure sensor deployment, optimum message delivery from the sensors to the defense center, aggregation speed, and efficiency in the data-processing performance of the defense center. We successfully developed a cloud-based Snort NIDS and found the optimum method for message-delivery from the sensor to the defense center. We also succeeded in developing the dashboard and attack maps to display the attack statistics and visualize the attacks. Our first design is reported to implement the big data architecture, namely, lambda architecture, as the defense center and utilize rapid deployment of Snort NIDS using Docker technology as the network security monitoring platform. Full article

(This article belongs to the Special Issue Cyber Security in Big Data Era)

► Show Figures

Graphical abstract

27 pages, 8416 KiB

Open AccessEditor’s ChoiceReview

Big Data in Construction: Current Applications and Future Opportunities

by Hafiz Suliman Munawar, Fahim Ullah, Siddra Qayyum and Danish Shahzad

Big Data Cogn. Comput. 2022, 6(1), 18; https://doi.org/10.3390/bdcc6010018 - 6 Feb 2022

Cited by 44 | Viewed by 19475

Abstract

Big data have become an integral part of various research fields due to the rapid advancements in the digital technologies available for dealing with data. The construction industry is no exception and has seen a spike in the data being generated due to [...] Read more.

Big data have become an integral part of various research fields due to the rapid advancements in the digital technologies available for dealing with data. The construction industry is no exception and has seen a spike in the data being generated due to the introduction of various digital disruptive technologies. However, despite the availability of data and the introduction of such technologies, the construction industry is lagging in harnessing big data. This paper critically explores literature published since 2010 to identify the data trends and how the construction industry can benefit from big data. The presence of tools such as computer-aided drawing (CAD) and building information modelling (BIM) provide a great opportunity for researchers in the construction industry to further improve how infrastructure can be developed, monitored, or improved in the future. The gaps in the existing research data have been explored and a detailed analysis was carried out to identify the different ways in which big data analysis and storage work in relevance to the construction industry. Big data engineering (BDE) and statistics are among the most crucial steps for integrating big data technology in construction. The results of this study suggest that while the existing research studies have set the stage for improving big data research, the integration of the associated digital technologies into the construction industry is not very clear. Among the future opportunities, big data research into construction safety, site management, heritage conservation, and project waste minimization and quality improvements are key areas. Full article

(This article belongs to the Special Issue Review Papers in Big Data, Cloud-Based Data Analysis and Learning Systems)

► Show Figures

Figure 1

29 pages, 1272 KiB

Open AccessEditor’s ChoiceReview

Big Data Analytics in Supply Chain Management: A Systematic Literature Review and Research Directions

by In Lee and George Mangalaraj

Big Data Cogn. Comput. 2022, 6(1), 17; https://doi.org/10.3390/bdcc6010017 - 1 Feb 2022

Cited by 66 | Viewed by 31280

Abstract

Big data analytics has been successfully used for various business functions, such as accounting, marketing, supply chain, and operations. Currently, along with the recent development in machine learning and computing infrastructure, big data analytics in the supply chain are surging in importance. In [...] Read more.

Big data analytics has been successfully used for various business functions, such as accounting, marketing, supply chain, and operations. Currently, along with the recent development in machine learning and computing infrastructure, big data analytics in the supply chain are surging in importance. In light of the great interest and evolving nature of big data analytics in supply chains, this study conducts a systematic review of existing studies in big data analytics. This study presents a framework of a systematic literature review from interdisciplinary perspectives. From the organizational perspective, this study examines the theoretical foundations and research models that explain the sustainability and performances achieved through the use of big data analytics. Then, from the technical perspective, this study analyzes types of big data analytics, techniques, algorithms, and features developed for enhanced supply chain functions. Finally, this study identifies the research gap and suggests future research directions. Full article

(This article belongs to the Special Issue Big Data and Cognitive Computing: 5th Anniversary Feature Papers)

► Show Figures

Figure 1

22 pages, 2156 KiB

Open AccessEditor’s ChoiceBrief Report

A Dataset for Emotion Recognition Using Virtual Reality and EEG (DER-VREEG): Emotional State Classification Using Low-Cost Wearable VR-EEG Headsets

by Nazmi Sofian Suhaimi, James Mountstephens and Jason Teo

Big Data Cogn. Comput. 2022, 6(1), 16; https://doi.org/10.3390/bdcc6010016 - 28 Jan 2022

Cited by 45 | Viewed by 9462

Abstract

Emotions are viewed as an important aspect of human interactions and conversations, and allow effective and logical decision making. Emotion recognition uses low-cost wearable electroencephalography (EEG) headsets to collect brainwave signals and interpret these signals to provide information on the mental state of [...] Read more.

Emotions are viewed as an important aspect of human interactions and conversations, and allow effective and logical decision making. Emotion recognition uses low-cost wearable electroencephalography (EEG) headsets to collect brainwave signals and interpret these signals to provide information on the mental state of a person, with the implementation of a virtual reality environment in different applications; the gap between human and computer interaction, as well as the understanding process, would shorten, providing an immediate response to an individual’s mental health. This study aims to use a virtual reality (VR) headset to induce four classes of emotions (happy, scared, calm, and bored), to collect brainwave samples using a low-cost wearable EEG headset, and to run popular classifiers to compare the most feasible ones that can be used for this particular setup. Firstly, we attempt to build an immersive VR database that is accessible to the public and that can potentially assist with emotion recognition studies using virtual reality stimuli. Secondly, we use a low-cost wearable EEG headset that is both compact and small, and can be attached to the scalp without any hindrance, allowing freedom of movement for participants to view their surroundings inside the immersive VR stimulus. Finally, we evaluate the emotion recognition system by using popular machine learning algorithms and compare them for both intra-subject and inter-subject classification. The results obtained here show that the prediction model for the four-class emotion classification performed well, including the more challenging inter-subject classification, with the support vector machine (SVM Class Weight kernel) obtaining 85.01% classification accuracy. This shows that using less electrode channels but with proper parameter tuning and selection features affects the performance of the classifications. Full article

(This article belongs to the Special Issue Virtual Reality, Augmented Reality, and Human-Computer Interaction)

► Show Figures

Figure 1

16 pages, 1815 KiB

Open AccessEditor’s ChoiceArticle

Google Street View Images as Predictors of Patient Health Outcomes, 2017–2019

by Quynh C. Nguyen, Tom Belnap, Pallavi Dwivedi, Amir Hossein Nazem Deligani, Abhinav Kumar, Dapeng Li, Ross Whitaker, Jessica Keralis, Heran Mane, Xiaohe Yue, Thu T. Nguyen, Tolga Tasdizen and Kim D. Brunisholz

Big Data Cogn. Comput. 2022, 6(1), 15; https://doi.org/10.3390/bdcc6010015 - 27 Jan 2022

Cited by 14 | Viewed by 6050

Abstract

Collecting neighborhood data can both be time- and resource-intensive, especially across broad geographies. In this study, we leveraged 1.4 million publicly available Google Street View (GSV) images from Utah to construct indicators of the neighborhood built environment and evaluate their associations with 2017–2019 [...] Read more.

Collecting neighborhood data can both be time- and resource-intensive, especially across broad geographies. In this study, we leveraged 1.4 million publicly available Google Street View (GSV) images from Utah to construct indicators of the neighborhood built environment and evaluate their associations with 2017–2019 health outcomes of approximately one-third of the population living in Utah. The use of electronic medical records allows for the assessment of associations between neighborhood characteristics and individual-level health outcomes while controlling for predisposing factors, which distinguishes this study from previous GSV studies that were ecological in nature. Among 938,085 adult patients, we found that individuals living in communities in the highest tertiles of green streets and non-single-family homes have 10–27% lower diabetes, uncontrolled diabetes, hypertension, and obesity, but higher substance use disorders—controlling for age, White race, Hispanic ethnicity, religion, marital status, health insurance, and area deprivation index. Conversely, the presence of visible utility wires overhead was associated with 5–10% more diabetes, uncontrolled diabetes, hypertension, obesity, and substance use disorders. Our study found that non-single-family and green streets were related to a lower prevalence of chronic conditions, while visible utility wires and single-lane roads were connected with a higher burden of chronic conditions. These contextual characteristics can better help healthcare organizations understand the drivers of their patients’ health by further considering patients’ residential environments, which present both risks and resources. Full article

(This article belongs to the Special Issue Machine and Deep Learning in Computer Vision Applications)

► Show Figures

Figure 1

2 pages, 173 KiB

Open AccessEditorial

Acknowledgment to Reviewers of BDCC in 2021

by BDCC Editorial Office

Big Data Cogn. Comput. 2022, 6(1), 14; https://doi.org/10.3390/bdcc6010014 - 27 Jan 2022

Viewed by 2515

Abstract

Rigorous peer-reviews are the basis of high-quality academic publishing [...] Full article

29 pages, 6332 KiB

Open AccessEditor’s ChoiceArticle

Fuzzy Neural Network Expert System with an Improved Gini Index Random Forest-Based Feature Importance Measure Algorithm for Early Diagnosis of Breast Cancer in Saudi Arabia

by Ebrahem A. Algehyne, Muhammad Lawan Jibril, Naseh A. Algehainy, Osama Abdulaziz Alamri and Abdullah K. Alzahrani

Big Data Cogn. Comput. 2022, 6(1), 13; https://doi.org/10.3390/bdcc6010013 - 27 Jan 2022

Cited by 45 | Viewed by 6423

Abstract

Breast cancer is one of the common malignancies among females in Saudi Arabia and has also been ranked as the one most prevalent and the number two killer disease in the country. However, the clinical diagnosis process of any disease such as breast [...] Read more.

Breast cancer is one of the common malignancies among females in Saudi Arabia and has also been ranked as the one most prevalent and the number two killer disease in the country. However, the clinical diagnosis process of any disease such as breast cancer, coronary artery diseases, diabetes, COVID-19, among others, is often associated with uncertainty due to the complexity and fuzziness of the process. In this work, a fuzzy neural network expert system with an improved gini index random forest-based feature importance measure algorithm for early diagnosis of breast cancer in Saudi Arabia was proposed to address the uncertainty and ambiguity associated with the diagnosis of breast cancer and also the heavier burden on the overlay of the network nodes of the fuzzy neural network system that often happens due to insignificant features that are used to predict or diagnose the disease. An Improved Gini Index Random Forest-Based Feature Importance Measure Algorithm was used to select the five fittest features of the diagnostic wisconsin breast cancer database out of the 32 features of the dataset. The logistic regression, support vector machine, k-nearest neighbor, random forest, and gaussian naïve bayes learning algorithms were used to develop two sets of classification models. Hence, the classification models with full features (32) and models with the 5 fittest features. The two sets of classification models were evaluated, and the results of the evaluation were compared. The result of the comparison shows that the models with the selected fittest features outperformed their counterparts with full features in terms of accuracy, sensitivity, and sensitivity. Therefore, a fuzzy neural network based expert system was developed with the five selected fittest features and the system achieved 99.33% accuracy, 99.41% sensitivity, and 99.24% specificity. Moreover, based on the comparison of the system developed in this work against the previous works that used fuzzy neural network or other applied artificial intelligence techniques on the same dataset for diagnosis of breast cancer using the same dataset, the system stands to be the best in terms of accuracy, sensitivity, and specificity, respectively. The z test was also conducted, and the test result shows that there is significant accuracy achieved by the system for early diagnosis of breast cancer. Full article

► Show Figures

Figure 1

17 pages, 337 KiB

Open AccessEditor’s ChoiceReview

Scalable Extended Reality: A Future Research Agenda

by Vera Marie Memmesheimer and Achim Ebert

Big Data Cogn. Comput. 2022, 6(1), 12; https://doi.org/10.3390/bdcc6010012 - 26 Jan 2022

Cited by 10 | Viewed by 5368

Abstract

Extensive research has outlined the potential of augmented, mixed, and virtual reality applications. However, little attention has been paid to scalability enhancements fostering practical adoption. In this paper, we introduce the concept of scalable extended reality (XR

^{S}

), i.e., spaces scaling between [...] Read more.

Extensive research has outlined the potential of augmented, mixed, and virtual reality applications. However, little attention has been paid to scalability enhancements fostering practical adoption. In this paper, we introduce the concept of scalable extended reality (XR

^{S}

), i.e., spaces scaling between different displays and degrees of virtuality that can be entered by multiple, possibly distributed users. The development of such XR

^{S}

spaces concerns several research fields. To provide bidirectional interaction and maintain consistency with the real environment, virtual reconstructions of physical scenes need to be segmented semantically and adapted dynamically. Moreover, scalable interaction techniques for selection, manipulation, and navigation as well as a world-stabilized rendering of 2D annotations in 3D space are needed to let users intuitively switch between handheld and head-mounted displays. Collaborative settings should further integrate access control and awareness cues indicating the collaborators’ locations and actions. While many of these topics were investigated by previous research, very few have considered their integration to enhance scalability. Addressing this gap, we review related previous research, list current barriers to the development of XR

^{S}

spaces, and highlight dependencies between them. Full article

(This article belongs to the Special Issue Virtual Reality, Augmented Reality, and Human-Computer Interaction)

► Show Figures

Figure 1

21 pages, 30151 KiB

Open AccessEditor’s ChoiceArticle

Context-Aware Explainable Recommendation Based on Domain Knowledge Graph

by Muzamil Hussain Syed, Tran Quoc Bao Huy and Sun-Tae Chung

Big Data Cogn. Comput. 2022, 6(1), 11; https://doi.org/10.3390/bdcc6010011 - 20 Jan 2022

Cited by 16 | Viewed by 6879

Abstract

With the rapid growth of internet data, knowledge graphs (KGs) are considered as efficient form of knowledge representation that captures the semantics of web objects. In recent years, reasoning over KG for various artificial intelligence tasks have received a great deal of research [...] Read more.

With the rapid growth of internet data, knowledge graphs (KGs) are considered as efficient form of knowledge representation that captures the semantics of web objects. In recent years, reasoning over KG for various artificial intelligence tasks have received a great deal of research interest. Providing recommendations based on users’ natural language queries is an equally difficult undertaking. In this paper, we propose a novel, context-aware recommender system, based on domain KG, to respond to user-defined natural queries. The proposed recommender system consists of three stages. First, we generate incomplete triples from user queries, which are then segmented using logical conjunction (∧) and disjunction (∨) operations. Then, we generate candidates by utilizing a KGE-based framework (Query2Box) for reasoning over segmented logical triples, with ∧, ∨, and ∃ operators; finally, the generated candidates are re-ranked using neural collaborative filtering (NCF) model by exploiting contextual (auxiliary) information from GraphSAGE embedding. Our approach demonstrates to be simple, yet efficient, at providing explainable recommendations on user’s queries, while leveraging user-item contextual information. Furthermore, our framework has shown to be capable of handling logical complex queries by transforming them into a disjunctive normal form (DNF) of simple queries. In this work, we focus on the restaurant domain as an application domain and use the Yelp dataset to evaluate the system. Experiments demonstrate that the proposed recommender system generalizes well on candidate generation from logical queries and effectively re-ranks those candidates, compared to the matrix factorization model. Full article

(This article belongs to the Special Issue Semantic Web Technology and Recommender Systems)

► Show Figures

Figure 1

16 pages, 407 KiB

Open AccessArticle

Extraction of the Relations among Significant Pharmacological Entities in Russian-Language Reviews of Internet Users on Medications

by Alexander Sboev, Anton Selivanov, Ivan Moloshnikov, Roman Rybka, Artem Gryaznov, Sanna Sboeva and Gleb Rylkov

Big Data Cogn. Comput. 2022, 6(1), 10; https://doi.org/10.3390/bdcc6010010 - 17 Jan 2022

Cited by 5 | Viewed by 3948

Abstract

Nowadays, the analysis of digital media aimed at prediction of the society’s reaction to particular events and processes is a task of a great significance. Internet sources contain a large amount of meaningful information for a set of domains, such as marketing, author [...] Read more.

Nowadays, the analysis of digital media aimed at prediction of the society’s reaction to particular events and processes is a task of a great significance. Internet sources contain a large amount of meaningful information for a set of domains, such as marketing, author profiling, social situation analysis, healthcare, etc. In the case of healthcare, this information is useful for the pharmacovigilance purposes, including re-profiling of medications. The analysis of the mentioned sources requires the development of automatic natural language processing methods. These methods, in turn, require text datasets with complex annotation including information about named entities and relations between them. As the relevant literature analysis shows, there is a scarcity of datasets in the Russian language with annotated entity relations, and none have existed so far in the medical domain. This paper presents the first Russian-language textual corpus where entities have labels of different contexts within a single text, so that related entities share a common context. therefore this corpus is suitable for the task of belonging to the medical domain. Our second contribution is a method for the automated extraction of entity relations in Russian-language texts using the XLM-RoBERTa language model preliminarily trained on Russian drug review texts. A comparison with other machine learning methods is performed to estimate the efficiency of the proposed method. The method yields state-of-the-art accuracy of extracting the following relationship types: ADR–Drugname, Drugname–Diseasename, Drugname–SourceInfoDrug, Diseasename–Indication. As shown on the presented subcorpus from the Russian Drug Review Corpus, the method developed achieves a mean F1-score of 80.4% (estimated with cross-validation, averaged over the four relationship types). This result is 3.6% higher compared to the existing language model RuBERT, and 21.77% higher compared to basic ML classifiers. Full article

(This article belongs to the Special Issue Knowledge Modelling and Learning through Cognitive Networks)

► Show Figures

Figure 1

13 pages, 4985 KiB

Open AccessArticle

An Efficient Multi-Scale Anchor Box Approach to Detect Partial Faces from a Video Sequence

by Dweepna Garg, Priyanka Jain, Ketan Kotecha, Parth Goel and Vijayakumar Varadarajan

Big Data Cogn. Comput. 2022, 6(1), 9; https://doi.org/10.3390/bdcc6010009 - 11 Jan 2022

Cited by 8 | Viewed by 4188

Abstract

In recent years, face detection has achieved considerable attention in the field of computer vision using traditional machine learning techniques and deep learning techniques. Deep learning is used to build the most recent and powerful face detection algorithms. However, partial face detection still [...] Read more.

In recent years, face detection has achieved considerable attention in the field of computer vision using traditional machine learning techniques and deep learning techniques. Deep learning is used to build the most recent and powerful face detection algorithms. However, partial face detection still remains to achieve remarkable performance. Partial faces are occluded due to hair, hat, glasses, hands, mobile phones, and side-angle-captured images. Fewer facial features can be identified from such images. In this paper, we present a deep convolutional neural network face detection method using the anchor boxes section strategy. We limited the number of anchor boxes and scales and chose only relevant to the face shape. The proposed model was trained and tested on a popular and challenging face detection benchmark dataset, i.e., Face Detection Dataset and Benchmark (FDDB), and can also detect partially covered faces with better accuracy and precision. Extensive experiments were performed, with evaluation metrics including accuracy, precision, recall, F1 score, inference time, and FPS. The results show that the proposed model is able to detect the face in the image, including occluded features, more precisely than other state-of-the-art approaches, achieving 94.8% accuracy and 98.7% precision on the FDDB dataset at 21 frames per second (FPS). Full article

(This article belongs to the Special Issue Big Data and Internet of Things)

► Show Figures

Figure 1

17 pages, 2339 KiB

Open AccessArticle

An Empirical Comparison of Portuguese and Multilingual BERT Models for Auto-Classification of NCM Codes in International Trade

by Roberta Rodrigues de Lima, Anita M. R. Fernandes, James Roberto Bombasar, Bruno Alves da Silva, Paul Crocker and Valderi Reis Quietinho Leithardt

Big Data Cogn. Comput. 2022, 6(1), 8; https://doi.org/10.3390/bdcc6010008 - 10 Jan 2022

Cited by 5 | Viewed by 4531

Abstract

Classification problems are common activities in many different domains and supervised learning algorithms have shown great promise in these areas. The classification of goods in international trade in Brazil represents a real challenge due to the complexity involved in assigning the correct category [...] Read more.

Classification problems are common activities in many different domains and supervised learning algorithms have shown great promise in these areas. The classification of goods in international trade in Brazil represents a real challenge due to the complexity involved in assigning the correct category codes to a good, especially considering the tax penalties and legal implications of a misclassification. This work focuses on the training process of a classifier based on bidirectional encoder representations from transformers (BERT) for tax classification of goods with MCN codes which are the official classification system for import and export products in Brazil. In particular, this article presents results from using a specific Portuguese-language-pretrained BERT model, as well as results from using a multilingual-pretrained BERT model. Experimental results show that Portuguese model had a slightly better performance than the multilingual model, achieving an MCC 0.8491, and confirms that the classifiers could be used to improve specialists’ performance in the classification of goods. Full article

► Show Figures

Figure 1

15 pages, 1426 KiB

Open AccessArticle

Infusing Autopoietic and Cognitive Behaviors into Digital Automata to Improve Their Sentience, Resilience, and Intelligence

by Rao Mikkilineni

Big Data Cogn. Comput. 2022, 6(1), 7; https://doi.org/10.3390/bdcc6010007 - 10 Jan 2022

Cited by 7 | Viewed by 4757

Abstract

All living beings use autopoiesis and cognition to manage their “life” processes from birth through death. Autopoiesis enables them to use the specification in their genomes to instantiate themselves using matter and energy transformations. They reproduce, replicate, and manage their stability. Cognition allows [...] Read more.

All living beings use autopoiesis and cognition to manage their “life” processes from birth through death. Autopoiesis enables them to use the specification in their genomes to instantiate themselves using matter and energy transformations. They reproduce, replicate, and manage their stability. Cognition allows them to process information into knowledge and use it to manage its interactions between various constituent parts within the system and its interaction with the environment. Currently, various attempts are underway to make modern computers mimic the resilience and intelligence of living beings using symbolic and sub-symbolic computing. We discuss here the limitations of classical computer science for implementing autopoietic and cognitive behaviors in digital machines. We propose a new architecture applying the general theory of information (GTI) and pave the path to make digital automata mimic living organisms by exhibiting autopoiesis and cognitive behaviors. The new science, based on GTI, asserts that information is a fundamental constituent of the physical world and that living beings convert information into knowledge using physical structures that use matter and energy. Our proposal uses the tools derived from GTI to provide a common knowledge representation from existing symbolic and sub-symbolic computing structures to implement autopoiesis and cognitive behaviors. Full article

(This article belongs to the Special Issue Data, Structure, and Information in Artificial Intelligence)

► Show Figures

Figure 1

16 pages, 9862 KiB

Open AccessEditor’s ChoiceArticle

On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining

by Gomathy Ramaswami, Teo Susnjak and Anuradha Mathrani

Big Data Cogn. Comput. 2022, 6(1), 6; https://doi.org/10.3390/bdcc6010006 - 7 Jan 2022

Cited by 23 | Viewed by 5124

Abstract

Poor academic performance of students is a concern in the educational sector, especially if it leads to students being unable to meet minimum course requirements. However, with timely prediction of students’ performance, educators can detect at-risk students, thereby enabling early interventions for supporting [...] Read more.

Poor academic performance of students is a concern in the educational sector, especially if it leads to students being unable to meet minimum course requirements. However, with timely prediction of students’ performance, educators can detect at-risk students, thereby enabling early interventions for supporting these students in overcoming their learning difficulties. However, the majority of studies have taken the approach of developing individual models that target a single course while developing prediction models. These models are tailored to specific attributes of each course amongst a very diverse set of possibilities. While this approach can yield accurate models in some instances, this strategy is associated with limitations. In many cases, overfitting can take place when course data is small or when new courses are devised. Additionally, maintaining a large suite of models per course is a significant overhead. This issue can be tackled by developing a generic and course-agnostic predictive model that captures more abstract patterns and is able to operate across all courses, irrespective of their differences. This study demonstrates how a generic predictive model can be developed that identifies at-risk students across a wide variety of courses. Experiments were conducted using a range of algorithms, with the generic model producing an effective accuracy. The findings showed that the CatBoost algorithm performed the best on our dataset across the F-measure, ROC (receiver operating characteristic) curve and AUC scores; therefore, it is an excellent candidate algorithm for providing solutions on this domain given its capabilities to seamlessly handle categorical and missing data, which is frequently a feature in educational datasets. Full article

(This article belongs to the Special Issue Educational Data Mining and Technology)

► Show Figures

Figure 1

26 pages, 1044 KiB

Open AccessEditor’s ChoiceArticle

A Hierarchical Hadoop Framework to Process Geo-Distributed Big Data

by Giuseppe Di Modica and Orazio Tomarchio

Big Data Cogn. Comput. 2022, 6(1), 5; https://doi.org/10.3390/bdcc6010005 - 6 Jan 2022

Cited by 3 | Viewed by 4173

Abstract

In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a [...] Read more.

In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a Big Data. State-of-the-art parallel computing techniques such as the MapReduce guarantee high performance in scenarios where involved computing nodes are equally sized and clustered via broadband network links, and the data are co-located with the cluster of nodes. Unfortunately, the mentioned techniques have proven ineffective in geographically distributed scenarios, i.e., computing contexts where nodes and data are geographically distributed across multiple distant data centers. In the literature, researchers have proposed variants of the MapReduce paradigm that obtain awareness of the constraints imposed in those scenarios (such as the imbalance of nodes computing power and of interconnecting links) to enforce smart task scheduling strategies. We have designed a hierarchical computing framework in which a context-aware scheduler orchestrates computing tasks that leverage the potential of the vanilla Hadoop framework within each data center taking part in the computation. In this work, after presenting the features of the developed framework, we advocate the opportunity of fragmenting the data in a smart way so that the scheduler produces a fairer distribution of the workload among the computing tasks. To prove the concept, we implemented a software prototype of the framework and ran several experiments on a small-scale testbed. Test results are discussed in the last part of the paper. Full article

(This article belongs to the Special Issue Big Data Analytics and Cloud Data Management)

► Show Figures

Figure 1

16 pages, 6652 KiB

Open AccessEditor’s ChoiceArticle

Analyzing COVID-19 Medical Papers Using Artificial Intelligence: Insights for Researchers and Medical Professionals

by Dmitry Soshnikov, Tatiana Petrova, Vickie Soshnikova and Andrey Grunin

Big Data Cogn. Comput. 2022, 6(1), 4; https://doi.org/10.3390/bdcc6010004 - 5 Jan 2022

Cited by 1 | Viewed by 3968

Abstract

Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence [...] Read more.

Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence (AI) is highly needed. We propose the AI-based tool to help researchers navigate the medical papers collections in a meaningful way and extract some knowledge from scientific COVID-19 papers. The main idea of our approach is to get as much semi-structured information from text corpus as possible, using named entity recognition (NER) with a model called PubMedBERT and Text Analytics for Health service, then store the data into NoSQL database for further fast processing and insights generation. Additionally, the contexts in which the entities were used (neutral or negative) are determined. Application of NLP and text-based emotion detection (TBED) methods to COVID-19 text corpus allows us to gain insights on important issues of diagnosis and treatment (such as changes in medical treatment over time, joint treatment strategies using several medications, and the connection between signs and symptoms of coronavirus, etc.). Full article

► Show Figures

Figure 1

16 pages, 1393 KiB

Open AccessFeature PaperEditor’s ChoiceArticle

Analyzing Political Polarization on Social Media by Deleting Bot Spamming

by Riccardo Cantini, Fabrizio Marozzo, Domenico Talia and Paolo Trunfio

Big Data Cogn. Comput. 2022, 6(1), 3; https://doi.org/10.3390/bdcc6010003 - 4 Jan 2022

Cited by 12 | Viewed by 7333

Abstract

Social media platforms are part of everyday life, allowing the interconnection of people around the world in large discussion groups relating to every topic, including important social or political issues. Therefore, social media have become a valuable source of information-rich data, commonly referred [...] Read more.

Social media platforms are part of everyday life, allowing the interconnection of people around the world in large discussion groups relating to every topic, including important social or political issues. Therefore, social media have become a valuable source of information-rich data, commonly referred to as Social Big Data, effectively exploitable to study the behavior of people, their opinions, moods, interests and activities. However, these powerful communication platforms can be also used to manipulate conversation, polluting online content and altering the popularity of users, through spamming activities and misinformation spreading. Recent studies have shown the use on social media of automatic entities, defined as social bots, that appear as legitimate users by imitating human behavior aimed at influencing discussions of any kind, including political issues. In this paper we present a new methodology, namely TIMBRE (Time-aware opInion Mining via Bot REmoval), aimed at discovering the polarity of social media users during election campaigns characterized by the rivalry of political factions. This methodology is temporally aware and relies on a keyword-based classification of posts and users. Moreover, it recognizes and filters out data produced by social media bots, which aim to alter public opinion about political candidates, thus avoiding heavily biased information. The proposed methodology has been applied to a case study that analyzes the polarization of a large number of Twitter users during the 2016 US presidential election. The achieved results show the benefits brought by both removing bots and taking into account temporal aspects in the forecasting process, revealing the high accuracy and effectiveness of the proposed approach. Finally, we investigated how the presence of social bots may affect political discussion by studying the 2016 US presidential election. Specifically, we analyzed the main differences between human and artificial political support, estimating also the influence of social bots on legitimate users. Full article

(This article belongs to the Special Issue Big Data and Cognitive Computing: 5th Anniversary Feature Papers)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Big Data Cogn. Comput., Volume 6, Issue 1 (March 2022) – 32 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI