Topic Editors

Department of Mathematics and Statistics, Brock University, St. Catharines, ON L2S 3A1, Canada
Department of Biostatistics, Yale University, 60 College Street, New Haven, CT 06520, USA
Prof. Dr. Peter X.K. Song
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA

Complex Data Analytics and Computing with Real-World Applications

Abstract submission deadline
closed (22 September 2022)
Manuscript submission deadline
closed (22 November 2022)
Viewed by
63545

Topic Information

Dear Colleagues,

In today’s data-centric world, there is a host of buzzwords appearing everywhere in digital and print media. We encounter data in every walk of life, and the information they contain can be used to improve the wellbeing of society, business, health, and medicine. This presents substantial opportunities and challenges for analytically and objectively minded researchers. Extracting meaningful information from data is not an easy task. The rapid growth in the size and scope of datasets in a host of disciplines has created the need for innovative analytic strategies for analyzing and visualizing such data.

The rise of ‘Big Data’ will deepen our understanding of complex structured and unstructured data. Undoubtedly, comprehensive analysis of Big Data in real-world applications calls for analytically rigorous methods. Various methods have been developed to accommodate complex high-dimensional features to examine data patterns, variable relationships, prediction, and recommendation. Meanwhile, statistical theories have also correspondingly been developed.

The analysis of Big Data in real-world applications has drawn much attention from researchers worldwide. This Topic aims to provide a platform for the deep discussion of novel analytic methods developed for Big Data in applied areas. Both applied and theoretical contributions to these areas will be showcased.

The contributions to this Topic will present new and original research in analytic methods and novel applications. Contributions can have either an applied or theoretical perspective and emphasize different problems, with special emphasis on data analytics and statistical methodology. Manuscripts summarizing the most recent state of the art on these topics are welcome.

Prof. Dr. S. Ejaz Ahmed
Prof. Dr. Shuangge Steven Ma
Prof. Dr. Peter X.K. Song
Topic Editors

Keywords

  • algorithm
  • data science
  • statistical learning
  • machine learning
  • novel applications
  • new conceptual frameworks of estimation
  • large-scale structured data
  • unstructured data

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Entropy
entropy
2.1 4.9 1999 22.4 Days CHF 2600
Algorithms
algorithms
1.8 4.1 2008 15 Days CHF 1600
Big Data and Cognitive Computing
BDCC
3.7 7.1 2017 18 Days CHF 1800
Data
data
2.2 4.3 2016 27.7 Days CHF 1600

Preprints.org is a multidiscipline platform providing preprint service that is dedicated to sharing your research from the start and empowering your research journey.

MDPI Topics is cooperating with Preprints.org and has built a direct connection between MDPI journals and Preprints.org. Authors are encouraged to enjoy the benefits by posting a preprint at Preprints.org prior to publication:

  1. Immediately share your ideas ahead of publication and establish your research priority;
  2. Protect your idea from being stolen with this time-stamped preprint article;
  3. Enhance the exposure and impact of your research;
  4. Receive feedback from your peers in advance;
  5. Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (17 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
13 pages, 428 KiB  
Article
ImputeGAN: Generative Adversarial Network for Multivariate Time Series Imputation
by Rui Qin and Yong Wang
Entropy 2023, 25(1), 137; https://doi.org/10.3390/e25010137 - 10 Jan 2023
Cited by 10 | Viewed by 5419
Abstract
Since missing values in multivariate time series data are inevitable, many researchers have come up with methods to deal with the missing data. These include case deletion methods, statistics-based imputation methods, and machine learning-based imputation methods. However, these methods cannot handle temporal information, [...] Read more.
Since missing values in multivariate time series data are inevitable, many researchers have come up with methods to deal with the missing data. These include case deletion methods, statistics-based imputation methods, and machine learning-based imputation methods. However, these methods cannot handle temporal information, or the complementation results are unstable. We propose a model based on generative adversarial networks (GANs) and an iterative strategy based on the gradient of the complementary results to solve these problems. This ensures the generalizability of the model and the reasonableness of the complementation results. We conducted experiments on three large-scale datasets and compare them with traditional complementation methods. The experimental results show that imputeGAN outperforms traditional complementation methods in terms of accuracy of complementation. Full article
Show Figures

Figure 1

19 pages, 4093 KiB  
Article
Unsupervised Anomaly Detection for Intermittent Sequences Based on Multi-Granularity Abnormal Pattern Mining
by Lilin Fan, Jiahu Zhang, Wentao Mao and Fukang Cao
Entropy 2023, 25(1), 123; https://doi.org/10.3390/e25010123 - 7 Jan 2023
Cited by 3 | Viewed by 2116
Abstract
In the actual maintenance of manufacturing enterprises, abnormal changes in after-sale parts demand data often make the inventory strategies unreasonable. Due to the intermittent and small-scale characteristics of demand sequences, it is difficult to accurately identify the anomalies in such sequences using current [...] Read more.
In the actual maintenance of manufacturing enterprises, abnormal changes in after-sale parts demand data often make the inventory strategies unreasonable. Due to the intermittent and small-scale characteristics of demand sequences, it is difficult to accurately identify the anomalies in such sequences using current anomaly detection algorithms. To solve this problem, this paper proposes an unsupervised anomaly detection method for intermittent time series. First, a new abnormal fluctuation similarity matrix is built by calculating the squared coefficient of variation and the maximum information coefficient from the macroscopic granularity. The abnormal fluctuation sequence can then be adaptively screened by using agglomerative hierarchical clustering. Second, the demand change feature and interval feature of the abnormal sequence are constructed and fed into the support vector data description model to perform hypersphere training. Then, the unsupervised abnormal point location detection is realized at the micro-granularity level from the abnormal sequence. Comparative experiments are carried out on the actual demand data of after-sale parts of two large manufacturing enterprises. The results show that, compared with the current representative anomaly detection methods, the proposed approach can effectively identify the abnormal fluctuation position in the intermittent sequence of small samples, and also obtain better detection results. Full article
Show Figures

Figure 1

7 pages, 1006 KiB  
Data Descriptor
Numerical and Experimental Data of the Implementation of Logic Gates in an Erbium-Doped Fiber Laser (EDFL)
by Samuel Mardoqueo Afanador Delgado, José Luis Echenausía Monroy, Guillermo Huerta Cuellar, Juan Hugo García López and Rider Jaimes Reátegui
Data 2023, 8(1), 7; https://doi.org/10.3390/data8010007 - 26 Dec 2022
Cited by 1 | Viewed by 1814
Abstract
In this article, the methods for obtaining time series from an erbium-doped fiber laser (EDFL) and its numerical simulation are described. In addition, the nature of the obtained files, the meaning of the changing file names, and the ways of accessing these files [...] Read more.
In this article, the methods for obtaining time series from an erbium-doped fiber laser (EDFL) and its numerical simulation are described. In addition, the nature of the obtained files, the meaning of the changing file names, and the ways of accessing these files are described in detail. The response of the laser emission is controlled by the intensity of a digital signal added to the modulation, which allows for various logical operations. The numerical results are in good agreement with experimental observations. The authors provide all of the time series from an experimental implementation where various logic gates are obtained. Full article
Show Figures

Figure 1

16 pages, 675 KiB  
Article
University Academic Performance Development Prediction Based on TDA
by Daohua Yu, Xin Zhou, Yu Pan, Zhendong Niu, Xu Yuan and Huafei Sun
Entropy 2023, 25(1), 24; https://doi.org/10.3390/e25010024 - 23 Dec 2022
Cited by 4 | Viewed by 2393
Abstract
With the rapid development of higher education, the evaluation of the academic growth potential of universities has received extensive attention from scholars and educational administrators. Although the number of papers on university academic evaluation is increasing, few scholars have conducted research on the [...] Read more.
With the rapid development of higher education, the evaluation of the academic growth potential of universities has received extensive attention from scholars and educational administrators. Although the number of papers on university academic evaluation is increasing, few scholars have conducted research on the changing trend of university academic performance. Because traditional statistical methods and deep learning techniques have proven to be incapable of handling short time series data well, this paper proposes to adopt topological data analysis (TDA) to extract specified features from short time series data and then construct the model for the prediction of trend of university academic performance. The performance of the proposed method is evaluated by experiments on a real-world university academic performance dataset. By comparing the prediction results given by the Markov chain as well as SVM on the original data and TDA statistics, respectively, we demonstrate that the data generated by TDA methods can help construct very discriminative models and have a great advantage over the traditional models. In addition, this paper gives the prediction results as a reference, which provides a new perspective for the development evaluation of the academic performance of colleges and universities. Full article
Show Figures

Figure 1

17 pages, 3257 KiB  
Article
Tr-Predictior: An Ensemble Transfer Learning Model for Small-Sample Cloud Workload Prediction
by Chunhong Liu, Jie Jiao, Weili Li, Jingxiong Wang and Junna Zhang
Entropy 2022, 24(12), 1770; https://doi.org/10.3390/e24121770 - 3 Dec 2022
Cited by 5 | Viewed by 1986
Abstract
Accurate workload prediction plays a key role in intelligent scheduling decisions on cloud platforms. There are massive amounts of short-workload sequences in the cloud platform, and the small amount of data and the presence of outliers make accurate workload sequence prediction a challenge. [...] Read more.
Accurate workload prediction plays a key role in intelligent scheduling decisions on cloud platforms. There are massive amounts of short-workload sequences in the cloud platform, and the small amount of data and the presence of outliers make accurate workload sequence prediction a challenge. For the above issues, this paper proposes an ensemble learning method based on sample weight transfer and long short-term memory (LSTM), termed as Tr-Predictor. Specifically, a selection method of similar sequences combining time warp edit distance (TWED) and transfer entropy (TE) is proposed to select a source domain dataset with higher similarity for the target workload sequence. Then, we upgrade the basic learner of the ensemble model two-stage TrAdaBoost.R2 to LSTM in the deep model and enhance the ability of the ensemble model to extract sequence features. To optimize the weight adjustment strategy, we adopt a two-stage weight adjustment strategy and select the best weight for the learner according to the sample error and model error. Finally, the above process determines the parameters of the target model and uses the target model to predict the short-task sequences. In the experimental validation, we arbitrarily select nine sets of short-workload data from the Google dataset and three sets of short-workload data from the Alibaba cluster to verify the prediction effectiveness of the proposed algorithm. The experimental results show that compared with the commonly used cloud workload prediction methods Tr-Predictor has higher prediction accuracy on the small-sample workload. The prediction indicators of the ablation experiments show the performance gain of each part in the proposed method. Full article
Show Figures

Figure 1

23 pages, 9872 KiB  
Article
A Linguistic Analysis of News Coverage of E-Healthcare in China with a Heterogeneous Graphical Model
by Mengque Liu, Xinyan Fan and Shuangge Ma
Entropy 2022, 24(11), 1557; https://doi.org/10.3390/e24111557 - 29 Oct 2022
Viewed by 1586
Abstract
E-healthcare has been envisaged as a major component of the infrastructure of modern healthcare, and has been developing rapidly in China. For healthcare, news media can play an important role in raising public interest and utilization of a particular service and complicating (and, [...] Read more.
E-healthcare has been envisaged as a major component of the infrastructure of modern healthcare, and has been developing rapidly in China. For healthcare, news media can play an important role in raising public interest and utilization of a particular service and complicating (and, perhaps clouding) debate on public health policy issues. We conducted a linguistic analysis of news reports from January 2015 to June 2021 related to E-healthcare in mainland China, using a heterogeneous graphical modeling approach. This approach can simultaneously cluster the datasets and estimate the conditional dependence relationships of keywords. It was found that there were eight phases of media coverage. The focuses and main topics of media coverage were extracted based on the network hub and module detection. The temporal patterns of media reports were found to be mostly consistent with the policy trend. Specifically, in the policy embryonic period (2015–2016), two phases were obtained, industry management was the main topic, and policy and regulation were the focuses of media coverage. In the policy development period (2017–2019), four phases were discovered. All the four main topics, namely industry development, health care, financial market, and industry management, were present. In 2017 Q3–2017 Q4, the major focuses of media coverage included social security, healthcare and reform, and others. In 2018 Q1, industry regulation and finance became the focuses. In the policy outbreak period (2020–), two phases were discovered. Financial market and industry management were the main topics. Medical insurance and healthcare for the elderly became the focuses. This analysis can offer insights into how the media responds to public policy for E-healthcare, which can be valuable for the government, public health practitioners, health care industry investors, and others. Full article
Show Figures

Figure 1

12 pages, 1754 KiB  
Article
Assessing the Accuracy of Google Trends for Predicting Presidential Elections: The Case of Chile, 2006–2021
by Francisco Vergara-Perucich
Data 2022, 7(11), 143; https://doi.org/10.3390/data7110143 - 27 Oct 2022
Cited by 3 | Viewed by 2223
Abstract
This article presents the results of reviewing the predictive capacity of Google Trends for national elections in Chile. The electoral results of the elections between Michelle Bachelet and Sebastián Piñera in 2006, Sebastián Piñera and Eduardo Frei in 2010, Michelle Bachelet and Evelyn [...] Read more.
This article presents the results of reviewing the predictive capacity of Google Trends for national elections in Chile. The electoral results of the elections between Michelle Bachelet and Sebastián Piñera in 2006, Sebastián Piñera and Eduardo Frei in 2010, Michelle Bachelet and Evelyn Matthei in 2013, Sebastián Piñera and Alejandro Guillier in 2017, and Gabriel Boric and José Antonio Kast in 2021 were reviewed. The time series analyzed were organized on the basis of relative searches between the candidacies, assisted by R software, mainly with the gtrendsR and forecast libraries. With the series constructed, forecasts were made using the Auto Regressive Integrated Moving Average (ARIMA) technique to check the weight of one presidential option over the other. The ARIMA analyses were performed on 3 ways of organizing the data: the linear series, the series transformed by moving average, and the series transformed by Hodrick–Prescott. The results indicate that the method offers the optimal predictive ability. Full article
Show Figures

Figure 1

21 pages, 6061 KiB  
Article
Improving Real Estate Rental Estimations with Visual Data
by Ilia Azizi and Iegor Rudnytskyi
Big Data Cogn. Comput. 2022, 6(3), 96; https://doi.org/10.3390/bdcc6030096 - 9 Sep 2022
Cited by 4 | Viewed by 3610
Abstract
Multi-modal data are widely available for online real estate listings. Announcements can contain various forms of data, including visual data and unstructured textual descriptions. Nonetheless, many traditional real estate pricing models rely solely on well-structured tabular features. This work investigates whether it is [...] Read more.
Multi-modal data are widely available for online real estate listings. Announcements can contain various forms of data, including visual data and unstructured textual descriptions. Nonetheless, many traditional real estate pricing models rely solely on well-structured tabular features. This work investigates whether it is possible to improve the performance of the pricing model using additional unstructured data, namely images of the property and satellite images. We compare four models based on the type of input data they use: (1) tabular data only, (2) tabular data and property images, (3) tabular data and satellite images, and (4) tabular data and a combination of property and satellite images. In a supervised context, the branches of dedicated neural networks for each data type are fused (concatenated) to predict log rental prices. The novel dataset devised for the study (SRED) consists of 11,105 flat rentals advertised over the internet in Switzerland. The results reveal that using all three sources of data generally outperforms machine learning models built on only tabular information. The findings pave the way for further research on integrating other non-structured inputs, for instance, the textual descriptions of properties. Full article
Show Figures

Figure 1

10 pages, 3948 KiB  
Data Descriptor
Description and Use of Three-Dimensional Numerical Phantoms of Cardiac Computed Tomography Images
by Miguel Vera, Antonio Bravo and Rubén Medina
Data 2022, 7(8), 115; https://doi.org/10.3390/data7080115 - 16 Aug 2022
Cited by 3 | Viewed by 2088
Abstract
The World Health Organization indicates the top cause of death is heart disease. These diseases can be detected using several imaging modalities, especially cardiac computed tomography (CT), whose images have imperfections associated with noise and certain artifacts. To minimize the impact of these [...] Read more.
The World Health Organization indicates the top cause of death is heart disease. These diseases can be detected using several imaging modalities, especially cardiac computed tomography (CT), whose images have imperfections associated with noise and certain artifacts. To minimize the impact of these imperfections on the quality of the CT images, several researchers have developed digital image processing techniques (DPIT) by which the quality is evaluated considering several metrics and databases (DB), both real and simulated. This article describes the processes that made it possible to generate and utilize six three-dimensional synthetic cardiac DBs or voxels-based numerical phantoms. An exhaustive analysis of the most relevant features of images of the left ventricle, belonging to a real CT DB of the human heart, was performed. These features are recreated in the synthetic DBs, generating a reference phantom or ground truth free of imperfections (DB1) and five phantoms, in which Poisson noise (DB2), stair-step artifact (DB3), streak artifact (DB4), both artifacts (DB5) and all imperfections (DB6) are incorporated. These DBs can be used to determine the performance of DPIT, aimed at decreasing the effect of these imperfections on the quality of cardiac images. Full article
Show Figures

Figure 1

38 pages, 15102 KiB  
Article
Go Wild for a While? A Bibliometric Analysis of Two Themes in Tourism Demand Forecasting from 1980 to 2021: Current Status and Development
by Yuruixian Zhang, Wei Chong Choo, Yuhanis Abdul Aziz, Choy Leong Yee and Jen Sim Ho
Data 2022, 7(8), 108; https://doi.org/10.3390/data7080108 - 31 Jul 2022
Cited by 3 | Viewed by 2953
Abstract
Despite the fact that the concept of forecasting has emerged in the realm of tourism, studies delving into this sector have yet to provide a comprehensive overview of the evolution of tourism forecasting visualization. This research presents an analysis of the current state-of-the-art [...] Read more.
Despite the fact that the concept of forecasting has emerged in the realm of tourism, studies delving into this sector have yet to provide a comprehensive overview of the evolution of tourism forecasting visualization. This research presents an analysis of the current state-of-the-art tourism demand forecasting (TDF) and combined tourism demand forecasting (CTDF) systems. Based on the Web of Science Core Collection database, this study built a framework for bibliometric analysis from these fields in three distinct phases (1980–2021). Furthermore, the VOSviewer analysis software was employed to yield a clearer picture of the current status and developments in tourism forecasting research. Descriptive analysis and comprehensive knowledge network mappings using approaches such as co-citation analysis and cooperation networking were employed to identify trending research topics, the most important countries/regions, institutions, publications, and articles, and the most influential researchers. The results yielded demonstrate that scientific output pertaining to TDF exceeds the output pertaining to CTDF. However, there has been a substantial and exponential increase in both situations over recent years. In addition, the results indicated that tourism forecasting research has become increasingly diversified, with numerous combined methods presented. Furthermore, the most influential papers and writers were evaluated based on their citations, publications, network position, and relevance. The contemporary themes were also analyzed, and obstacles to the expansion of the literature were identified. This is the first study on two topics to demonstrate the ways in which bibliometric visualization can assist researchers in gaining perspectives in the tourism forecasting field by effectively communicating key findings, facilitating data exploration, and providing valuable data for future research. Full article
Show Figures

Figure 1

18 pages, 1953 KiB  
Article
An Influence Maximization Algorithm for Dynamic Social Networks Based on Effective Links
by Baojun Fu, Jianpei Zhang, Hongna Bai, Yuting Yang and Yu He
Entropy 2022, 24(7), 904; https://doi.org/10.3390/e24070904 - 30 Jun 2022
Cited by 5 | Viewed by 2383
Abstract
The connection between users in social networks can be maintained for a certain period of time, and the static network structure formed provides the basic conditions for various kinds of research, especially for discovering customer groups that can generate great influence, which is [...] Read more.
The connection between users in social networks can be maintained for a certain period of time, and the static network structure formed provides the basic conditions for various kinds of research, especially for discovering customer groups that can generate great influence, which is important for product promotion, epidemic prevention and control, and public opinion supervision, etc. However, the computational process of influence maximization ignores the timeliness of interaction behaviors among users, the screened target users cannot diffuse information well, and the time complexity of relying on greedy rules to handle the influence maximization problem is high. Therefore, this paper analyzes the influence of the interaction between nodes in dynamic social networks on information dissemination, extends the classical independent cascade model to a dynamic social network dissemination model based on effective links, and proposes a two-stage influence maximization solution algorithm (Outdegree Effective Link—OEL) based on node degree and effective links to enhance the efficiency of problem solving. In order to verify the effectiveness of the algorithm, five typical influence maximization methods are compared and analyzed on four real data sets. The results show that the OEL algorithm has good performance in propagation range and running time. Full article
Show Figures

Figure 1

15 pages, 1455 KiB  
Article
Entropy-Weight-Method-Based Integrated Models for Short-Term Intersection Traffic Flow Prediction
by Wenrui Qu, Jinhong Li, Wenting Song, Xiaoran Li, Yue Zhao, Hanlin Dong, Yanfei Wang, Qun Zhao and Yi Qi
Entropy 2022, 24(7), 849; https://doi.org/10.3390/e24070849 - 21 Jun 2022
Cited by 24 | Viewed by 3406
Abstract
Three different types of entropy weight methods (EWMs), i.e., EWM-A, EWM-B, and EWM-C, have been used by previous studies for integrating prediction models. These three methods use very different ideas on determining the weights of individual models for integration. To evaluate the performances [...] Read more.
Three different types of entropy weight methods (EWMs), i.e., EWM-A, EWM-B, and EWM-C, have been used by previous studies for integrating prediction models. These three methods use very different ideas on determining the weights of individual models for integration. To evaluate the performances of these three EWMs, this study applied them to developing integrated short-term traffic flow prediction models for signalized intersections. At first, two individual models, i.e., a k-nearest neighbors (KNN)-algorithm-based model and a neural-network-based model (Elman), were developed as individual models to be integrated using EWMs. These two models were selected because they have been widely used for traffic flow prediction and have been approved to be able to achieve good performance. After that, three integrated models were developed by using the three different types of EWMs. The performances of the three integrated models, as well as the individual KNN and Elman models, were compared. We found that the traffic flow predicted with the EWM-C model is the most accurate prediction for most of the days. Based on the model evaluation results, the advantages of using the EWM-C method were deliberated and the problems with the EWM-A and EWM-B methods were also discussed. Full article
Show Figures

Figure 1

21 pages, 6986 KiB  
Article
Unsupervised Few Shot Key Frame Extraction for Cow Teat Videos
by Youshan Zhang, Matthias Wieland and Parminder S. Basran
Data 2022, 7(5), 68; https://doi.org/10.3390/data7050068 - 23 May 2022
Cited by 2 | Viewed by 3125
Abstract
A novel method of monitoring the health of dairy cows in large-scale dairy farms is proposed via image-based analysis of cows on rotary-based milking platforms, where deep learning is used to classify the extent of teat-end hyperkeratosis. The videos can be analyzed to [...] Read more.
A novel method of monitoring the health of dairy cows in large-scale dairy farms is proposed via image-based analysis of cows on rotary-based milking platforms, where deep learning is used to classify the extent of teat-end hyperkeratosis. The videos can be analyzed to segment the teats for feature analysis, which can then be used to assess the risk of infections and other diseases. This analysis can be performed more efficiently by using the key frames of each cow as they pass through the image frame. Extracting key frames from these videos would greatly simplify this analysis, but there are several challenges. First, data collection in the farm setting is harsh, resulting in unpredictable temporal key frame positions; empty, obfuscated, or shifted images of the cow’s teats; frequently empty stalls due to challenges with herding cows into the parlor; and regular interruptions and reversals in the direction of the parlor. Second, supervised learning requires expensive and time-consuming human annotation of key frames, which is impractical in large commercial dairy farms housing thousands of cows. Unsupervised learning methods rely on large frame differences and often suffer low performance. In this paper, we propose a novel unsupervised few-shot learning model which extracts key frames from large (∼21,000 frames) video streams. Using a simple L1 distance metric that combines both image and deep features between each unlabeled frame and a few (32) labeled key frames, a key frame selection mechanism, and a quality check process, key frames can be extracted with sufficient accuracy (F score 63.6%) and timeliness (<10 min per 21,000 frames) for commercial dairy farm setting demands. Full article
Show Figures

Figure 1

19 pages, 3173 KiB  
Article
A Comparative Study of MongoDB and Document-Based MySQL for Big Data Application Data Management
by Cornelia A. Győrödi, Diana V. Dumşe-Burescu, Doina R. Zmaranda and Robert Ş. Győrödi
Big Data Cogn. Comput. 2022, 6(2), 49; https://doi.org/10.3390/bdcc6020049 - 5 May 2022
Cited by 12 | Viewed by 14181
Abstract
In the context of the heavy demands of Big Data, software developers have also begun to consider NoSQL data storage solutions. One of the important criteria when choosing a NoSQL database for an application is its performance in terms of speed of data [...] Read more.
In the context of the heavy demands of Big Data, software developers have also begun to consider NoSQL data storage solutions. One of the important criteria when choosing a NoSQL database for an application is its performance in terms of speed of data accessing and processing, including response times to the most important CRUD operations (CREATE, READ, UPDATE, DELETE). In this paper, the behavior of two of the major document-based NoSQL databases, MongoDB and document-based MySQL, was analyzed in terms of the complexity and performance of CRUD operations, especially in query operations. The main objective of the paper is to make a comparative analysis of the impact that each specific database has on application performance when realizing CRUD requests. To perform this analysis, a case-study application was developed using the two document-based MongoDB and MySQL databases, which aim to model and streamline the activity of service providers that use a lot of data. The results obtained demonstrate the performance of both databases for different volumes of data; based on these, a detailed analysis and several conclusions were presented to support a decision for choosing an appropriate solution that could be used in a big-data application. Full article
Show Figures

Figure 1

19 pages, 2456 KiB  
Article
A New Ontology-Based Method for Arabic Sentiment Analysis
by Safaa M. Khabour, Qasem A. Al-Radaideh and Dheya Mustafa
Big Data Cogn. Comput. 2022, 6(2), 48; https://doi.org/10.3390/bdcc6020048 - 29 Apr 2022
Cited by 11 | Viewed by 4597
Abstract
Arabic sentiment analysis is a process that aims to extract the subjective opinions of different users about different subjects since these opinions and sentiments are used to recognize their perspectives and judgments in a particular domain. Few research studies addressed semantic-oriented approaches for [...] Read more.
Arabic sentiment analysis is a process that aims to extract the subjective opinions of different users about different subjects since these opinions and sentiments are used to recognize their perspectives and judgments in a particular domain. Few research studies addressed semantic-oriented approaches for Arabic sentiment analysis based on domain ontologies and features’ importance. In this paper, we built a semantic orientation approach for calculating overall polarity from the Arabic subjective texts based on built domain ontology and the available sentiment lexicon. We used the ontology concepts to extract and weight the semantic domain features by considering their levels in the ontology tree and their frequencies in the dataset to compute the overall polarity of a given textual review based on the importance of each domain feature. For evaluation, an Arabic dataset from the hotels’ domain was selected to build the domain ontology and to test the proposed approach. The overall accuracy and f-measure reach 79.20% and 78.75%, respectively. Results showed that the approach outperformed the other semantic orientation approaches, and it is an appealing approach to be used for Arabic sentiment analysis. Full article
Show Figures

Figure 1

16 pages, 4582 KiB  
Article
An Image Encryption Algorithm Based on Discrete-Time Alternating Quantum Walk and Advanced Encryption Standard
by Guangzhe Liu, Wei Li, Xingkui Fan, Zhuang Li, Yuxuan Wang and Hongyang Ma
Entropy 2022, 24(5), 608; https://doi.org/10.3390/e24050608 - 27 Apr 2022
Cited by 22 | Viewed by 2804
Abstract
This paper proposes an image encryption scheme based on a discrete-time alternating quantum walk (AQW) and the advanced encryption standard (AES). We use quantum properties to improve the AES algorithm, which uses a keystream generator related to AQW parameters to generate a probability [...] Read more.
This paper proposes an image encryption scheme based on a discrete-time alternating quantum walk (AQW) and the advanced encryption standard (AES). We use quantum properties to improve the AES algorithm, which uses a keystream generator related to AQW parameters to generate a probability distribution matrix. Some singular values of the matrix are extracted as the key to the AES algorithm. The Rcon of the AES algorithm is replaced with the elements of the probability distribution matrix. Then, the ascending order of the size of the clone probability distribution matrix scrambles the mapping rules of the S-box and ShiftRow transformations in the AES algorithm. The algorithm uses a probability distribution matrix and plaintext XOR operation to complete the preprocessing and uses the modified AES algorithm to complete the encryption process. The technology is based on simulation verification, including pixel correlation, histograms, differential attacks, noise attacks, information entropy, key sensitivity, and space. The results demonstrate a remarkable encryption effect. Compared with other improved AES algorithms, this algorithm has the advantages of the original AES algorithm and improves the ability to resist correlation attacks. Full article
Show Figures

Figure 1

23 pages, 1539 KiB  
Article
A Non-Uniform Continuous Cellular Automata for Analyzing and Predicting the Spreading Patterns of COVID-19
by Puspa Eosina, Aniati Murni Arymurthy and Adila Alfa Krisnadhi
Big Data Cogn. Comput. 2022, 6(2), 46; https://doi.org/10.3390/bdcc6020046 - 24 Apr 2022
Cited by 4 | Viewed by 3994
Abstract
During the COVID-19 outbreak, modeling the spread of infectious diseases became a challenging research topic due to its rapid spread and high mortality rate. The main objective of a standard epidemiological model is to estimate the number of infected, suspected, and recovered from [...] Read more.
During the COVID-19 outbreak, modeling the spread of infectious diseases became a challenging research topic due to its rapid spread and high mortality rate. The main objective of a standard epidemiological model is to estimate the number of infected, suspected, and recovered from the illness by mathematical modeling. This model does not capture how the disease transmits between neighboring regions through interaction. A more general framework such as Cellular Automata (CA) is required to accommodate a more complex spatial interaction within the epidemiological model. The critical issue of modeling in the spread of diseases is how to reduce the prediction error. This research aims to formulate the influence of the interaction of a neighborhood on the spreading pattern of COVID-19 using a neighborhood frame model in a Cellular-Automata (CA) approach and obtain a predictive model for the COVID-19 spread with the error reduction to improve the model. We propose a non-uniform continuous CA (N-CCA) as our contribution to demonstrate the influence of interactions on the spread of COVID-19. The model has succeeded in demonstrating the influence of the interaction between regions on the COVID-19 spread, as represented by the coefficients obtained. These coefficients result from multiple regression models. The coefficient obtained represents the population’s behavior interacting with its neighborhood in a cell and influences the number of cases that occur the next day. The evaluation of the N-CCA model is conducted by root mean square error (RMSE) for the difference in the number of cases between prediction and real cases per cell in each region. This study demonstrates that this approach improves the prediction of accuracy for 14 days in the future using data points from the past 42 days, compared to a baseline model. Full article
Show Figures

Figure 1

Back to TopTop