Topic Editors

Center for Applied Intelligent Systems Research, Halmstad University, SE-301 18 Halmstad, Sweden
Department of Computer Science, University of Porto, R. Campo Alegre, 1021/1055, 4169-007 Porto, Portugal
Institute of Applied Computer Science, Jagiellonian Univeristy, 31-007 Krakow, Poland

Data Science and Knowledge Discovery

Abstract submission deadline
closed (31 December 2022)
Manuscript submission deadline
closed (31 March 2023)
Viewed by
92623

Topic Information

Dear Colleagues,

Data Science and Knowledge Discovery, often using Artificial Intelligence (AI), Machine Learning (ML) including Deep Learning (DL), and Data Mining (DM), are among the most exciting and rapidly growing research fields today. In recent years, they have been successfully used to solve practical problems in virtually every domain, such as engineering, healthcare, manufacturing, energy, transportation, education, and finance.

In this era of Big Data, considerable research is being focused on designing efficient Data Science methods. Nonetheless, practical applications of Knowledge Discovery face several challenges, with examples such as dealing with either too small or too big data, missing and uncertain data, highly multidimensional data, and the need for interpretable models that can provide trustable evidence and explanations of the predictions they make. Therefore, there is a need to develop methods that sift through large amounts of streaming data and extract useful high-level knowledge from there, without human supervision or with very little of it. In addition, learning and obtaining good generalization from fewer training examples, efficient data/knowledge representation schemes, knowledge transfer between tasks and domains, and learning to adapt to varying contexts are also examples of important research problems.

We invite authors from academia, industry, public sector and more to contribute high-quality papers to the Topic, including but not limited to, novel methodological developments, experimental and comparative studies, surveys, application-relevant results, and advances of Data Science, Knowledge Discovery, Artificial Intelligence and Machine Learning fields. Submitted papers can focus on any applications related the participating journals. This Topic welcomes the submission of technical, experimental and methodological papers, both theoretical and solving real-world problems, or proposing practically-relevant systems, as well as on general applications.

Prof. Dr. Sławomir Nowaczyk
Dr. Rita P. Ribeiro
Prof. Dr. Grzegorz Nalepa
Topic Editors

Keywords

  • data science
  • knowledge discovery
  • artificial intelligence
  • machine learning
  • deep learning
  • data mining
  • big data
  • active learning
  • explainable AI
  • neural networks
  • AI in healthcare
  • information retrieval
  • natural language processing
  • recommender systems
  • signal processing

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Applied Sciences
applsci
2.5 5.3 2011 17.8 Days CHF 2400
Electronics
electronics
2.6 5.3 2012 16.8 Days CHF 2400
Information
information
2.4 6.9 2010 14.9 Days CHF 1600
Mathematics
mathematics
2.3 4.0 2013 17.1 Days CHF 2600
Sensors
sensors
3.4 7.3 2001 16.8 Days CHF 2600

Preprints.org is a multidiscipline platform providing preprint service that is dedicated to sharing your research from the start and empowering your research journey.

MDPI Topics is cooperating with Preprints.org and has built a direct connection between MDPI journals and Preprints.org. Authors are encouraged to enjoy the benefits by posting a preprint at Preprints.org prior to publication:

  1. Immediately share your ideas ahead of publication and establish your research priority;
  2. Protect your idea from being stolen with this time-stamped preprint article;
  3. Enhance the exposure and impact of your research;
  4. Receive feedback from your peers in advance;
  5. Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (37 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
16 pages, 941 KiB  
Article
Continual Pre-Training of Language Models for Concept Prerequisite Learning with Graph Neural Networks
by Xin Tang, Kunjia Liu, Hao Xu, Weidong Xiao and Zhen Tan
Mathematics 2023, 11(12), 2780; https://doi.org/10.3390/math11122780 - 20 Jun 2023
Cited by 1 | Viewed by 2097
Abstract
Prerequisite chains are crucial to acquiring new knowledge efficiently. Many studies have been devoted to automatically identifying the prerequisite relationships between concepts from educational data. Though effective to some extent, these methods have neglected two key factors: most works have failed to utilize [...] Read more.
Prerequisite chains are crucial to acquiring new knowledge efficiently. Many studies have been devoted to automatically identifying the prerequisite relationships between concepts from educational data. Though effective to some extent, these methods have neglected two key factors: most works have failed to utilize domain-related knowledge to enhance pre-trained language models, thus making the textual representation of concepts less effective; they also ignore the fusion of semantic information and structural information formed by existing prerequisites. We propose a two-stage concept prerequisite learning model (TCPL), to integrate the above factors. In the first stage, we designed two continual pre-training tasks for domain-adaptive and task-specific enhancement, to obtain better textual representation. In the second stage, to leverage the complementary effects of the semantic and structural information, we optimized the encoder of the resource–concept graph and the pre-trained language model simultaneously, with hinge loss as an auxiliary training objective. Extensive experiments conducted on three public datasets demonstrated the effectiveness of the proposed approach. Our proposed model improved by 7.9%, 6.7%, 5.6%, and 8.4% on ACC, F1, AP, and AUC on average, compared to the state-of-the-art methods. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

17 pages, 2087 KiB  
Article
SFCA: A Scalable Formal Concepts Driven Architecture for Multi-Field Knowledge Graph Completion
by Xiaochun Sun, Chenmou Wu and Shuqun Yang
Appl. Sci. 2023, 13(11), 6851; https://doi.org/10.3390/app13116851 - 5 Jun 2023
Viewed by 1580
Abstract
With the proliferation of Knowledge Graphs (KGs), knowledge graph completion (KGC) has attracted much attention. Previous KGC methods focus on extracting shallow structural information from KGs or in combination with external knowledge, especially in commonsense concepts (generally, commonsense concepts refer to the basic [...] Read more.
With the proliferation of Knowledge Graphs (KGs), knowledge graph completion (KGC) has attracted much attention. Previous KGC methods focus on extracting shallow structural information from KGs or in combination with external knowledge, especially in commonsense concepts (generally, commonsense concepts refer to the basic concepts in related fields that are required for various tasks and academic research, for example, in the general domain, “Country” can be considered as a commonsense concept owned by “China”), to predict missing links. However, the technology of extracting commonsense concepts from the limited database is immature, and the scarce commonsense database is also bound to specific verticals (commonsense concepts vary greatly across verticals, verticals refer to a small field subdivided vertically under a large field). Furthermore, most existing KGC models refine performance on public KGs, leading to inapplicability to actual KGs. To address these limitations, we proposed a novel Scalable Formal Concept-driven Architecture (SFCA) to automatically encode factual triples into formal concepts as a superior structural feature, to support rich information to KGE. Specifically, we generate dense formal concepts first, then yield a handful of entity-related formal concepts by sampling and delimiting the appropriate candidate entity range via the filtered formal concepts to improve the inference of KGC. Compared with commonsense concepts, KGC benefits from more valuable information from the formal concepts, and our self-supervision extraction method can be applied to any KGs. Comprehensive experiments on five public datasets demonstrate the effectiveness and scalability of SFCA. Besides, the proposed architecture also achieves the SOTA performance on the industry dataset. This method provides a new idea in the promotion and application of knowledge graphs in AI downstream tasks in general and industrial fields. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

37 pages, 2773 KiB  
Review
Data Science Methods and Tools for Industry 4.0: A Systematic Literature Review and Taxonomy
by Helder Moreira Arruda, Rodrigo Simon Bavaresco, Rafael Kunst, Elvis Fernandes Bugs, Giovani Cheuiche Pesenti and Jorge Luis Victória Barbosa
Sensors 2023, 23(11), 5010; https://doi.org/10.3390/s23115010 - 23 May 2023
Cited by 7 | Viewed by 4051
Abstract
The Fourth Industrial Revolution, also named Industry 4.0, is leveraging several modern computing fields. Industry 4.0 comprises automated tasks in manufacturing facilities, which generate massive quantities of data through sensors. These data contribute to the interpretation of industrial operations in favor of managerial [...] Read more.
The Fourth Industrial Revolution, also named Industry 4.0, is leveraging several modern computing fields. Industry 4.0 comprises automated tasks in manufacturing facilities, which generate massive quantities of data through sensors. These data contribute to the interpretation of industrial operations in favor of managerial and technical decision-making. Data science supports this interpretation due to extensive technological artifacts, particularly data processing methods and software tools. In this regard, the present article proposes a systematic literature review of these methods and tools employed in distinct industrial segments, considering an investigation of different time series levels and data quality. The systematic methodology initially approached the filtering of 10,456 articles from five academic databases, 103 being selected for the corpus. Thereby, the study answered three general, two focused, and two statistical research questions to shape the findings. As a result, this research found 16 industrial segments, 168 data science methods, and 95 software tools explored by studies from the literature. Furthermore, the research highlighted the employment of diverse neural network subvariations and missing details in the data composition. Finally, this article organized these results in a taxonomic approach to synthesize a state-of-the-art representation and visualization, favoring future research studies in the field. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

18 pages, 2246 KiB  
Article
Feature-Alignment-Based Cross-Platform Question Answering Expert Recommendation
by Bin Tang, Qinqin Gao, Xin Cui, Qinglong Peng and Xu Yu
Mathematics 2023, 11(9), 2174; https://doi.org/10.3390/math11092174 - 5 May 2023
Cited by 3 | Viewed by 1437
Abstract
Community question answering (CQA), with its flexible user interaction characteristics, is gradually becoming a new knowledge-sharing platform that allows people to acquire knowledge and share experiences. The number of questions is rapidly increasing with the open registration of communities and the massive influx [...] Read more.
Community question answering (CQA), with its flexible user interaction characteristics, is gradually becoming a new knowledge-sharing platform that allows people to acquire knowledge and share experiences. The number of questions is rapidly increasing with the open registration of communities and the massive influx of users, which makes it impossible to match many questions to suitable question answering experts (noted as experts) in a timely manner. Therefore, it is of great importance to perform expert recommendation in CQA. Existing expert recommendation algorithms only use data from a single platform, which is not ideal for new CQA platforms with sparse historical interaction and a small number of questions and users. Considering that many mature CQA platforms (source platforms) have rich historical interaction data and a large amount of questions and experts, this paper will fully mine the information and transfer it to new platforms with sparse data (target platform), which can effectively alleviate the data sparsity problem. However, the feature composition of questions and experts in different platforms is inconsistent, so the data from the source platform cannot be directly transferred for training in the target platform. Therefore, this paper proposes feature-alignment-based cross-platform question answering expert recommendation (FA-CPQAER), which can align expert and question features while transferring data. First, we use the rating predictor composed by the BP network for expert recommendation within the domains, and then the feature matching of questions and experts between two domains by similarity calculation is achieved for the purpose of using the information in the source platform to assist expert recommendation in the target platform. Meanwhile, we train a stacked denoising autoencoder (SDAE) in both domains, which can map user and question features to the same dimension and align the data distributions. Extensive experiments are conducted on two real CQA datasets, Toutiao and Zhihu datasets, and the results show that compared to the other advanced expert recommendation algorithms, this paper’s method achieves better results in the evaluation metrics of MAE, RMSE, Accuracy, and Recall, which fully demonstrates the effectiveness of the method in this paper to solve the data sparsity problem in expert recommendation. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

21 pages, 2799 KiB  
Article
Fuzzy Discretization on the Multinomial Naïve Bayes Method for Modeling Multiclass Classification of Corn Plant Diseases and Pests
by Yulia Resti, Chandra Irsan, Adinda Neardiaty, Choirunnisa Annabila and Irsyadi Yani
Mathematics 2023, 11(8), 1761; https://doi.org/10.3390/math11081761 - 7 Apr 2023
Cited by 7 | Viewed by 1903
Abstract
As an agricultural commodity, corn functions as food, animal feed, and industrial raw material. Therefore, diseases and pests pose a major challenge to the production of corn plants. Modeling the classification of corn plant diseases and pests based on digital images is essential [...] Read more.
As an agricultural commodity, corn functions as food, animal feed, and industrial raw material. Therefore, diseases and pests pose a major challenge to the production of corn plants. Modeling the classification of corn plant diseases and pests based on digital images is essential for developing an information technology-based early detection system. This plant’s early detection technology is beneficial for lowering farmers’ losses. The detection system based on digital images is also cost-effective. This paper aims to model the classification of corn plant diseases and pests based on digital images by implementing fuzzy discretization. Discretization is an essential technique to improve the knowledge extraction process of continuous-type data. It is also essential in some methods where continuous data must be processed or handled. Fuzzy discretization allows classes to have overlapping intervals so that they can handle information that is vague or unclear. We developed hypotheses and proved that different combinations of membership functions in fuzzy discretization affect classification performance. Empirical assessment using Monte Carlo resampling was carried out to obtain the generalizability of the performance of the best classification model of all proposed models. The best model is determined based on the number of metrics with the highest value and the highest metric on the Fscore and Kappa, a multiclass measure. The combination of digital image data preprocessing and classification methods also affects the performance of the classification model. We hope this work can provide an overview for experts in building early detection systems of corn plant diseases and pests using classification models based on fuzzy discretization. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

26 pages, 1650 KiB  
Article
Data Analysis for Information Discovery
by Alberto Amato and Vincenzo Di Lecce
Appl. Sci. 2023, 13(6), 3481; https://doi.org/10.3390/app13063481 - 9 Mar 2023
Cited by 1 | Viewed by 1438
Abstract
Artificial intelligence applications are becoming increasingly popular and are producing better results in many areas of research. The quality of the results depends on the quantity of data and its information content. In recent years, the amount of data available has increased significantly, [...] Read more.
Artificial intelligence applications are becoming increasingly popular and are producing better results in many areas of research. The quality of the results depends on the quantity of data and its information content. In recent years, the amount of data available has increased significantly, but this does not always mean more information and therefore better results. The aim of this work is to evaluate the effects of a new data preprocessing method for machine learning. This method was designed for sparce matrix approximation, and it is called semi-pivoted QR approximation (SPQR). To best of our knowledge, it has never been applied to data preprocessing in machine learning algorithms. This method works as a feature selection algorithm, and in this work, an evaluation of its effects on the performance of an unsupervised clustering algorithm is proposed. The obtained results are compared to those obtained using, as preprocessing algorithm, principal component analysis (PCA). These two methods have been applied to various publicly available datasets. The obtained results show that the SPQR algorithm can achieve results comparable to those obtained using PCA without introducing any transformation of the original dataset. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

17 pages, 647 KiB  
Article
Sufficient Networks for Computing Support of Graph Patterns
by Natalia Vanetik
Information 2023, 14(3), 143; https://doi.org/10.3390/info14030143 - 21 Feb 2023
Viewed by 1426
Abstract
Graph mining is the process of extracting and analyzing patterns from graph data. Graphs are a data structure that consists of a set of nodes and a set of edges that connect these nodes. Graphs are often used to represent real-world entities and [...] Read more.
Graph mining is the process of extracting and analyzing patterns from graph data. Graphs are a data structure that consists of a set of nodes and a set of edges that connect these nodes. Graphs are often used to represent real-world entities and the relationships between them. In a graph database, the importance of a pattern (also known as support) must be quantified using a counting function called a support measure. This function must adhere to several constraints, such as antimonotonicity that forbids a pattern to have support bigger than its sub-patterns. These constraints make the tasks of defining and computing support measures highly non-trivial and computationally expensive. In this paper, I use the previously discovered relationship between support measures in graph databases and flows in networks of subgraph appearances to simplify the process of computing support measures. I show that the network of pattern instances may be successfully pruned to contain just particular kinds of patterns and prove that any legitimate computing support measures in graph databases can adopt this strategy. When the suggested method is utilized, experimental evaluation demonstrates that network size reduction is significant. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

16 pages, 5204 KiB  
Article
Multidimensional Domain Knowledge Framework for Poet Profiling
by Ai Zhou, Yijia Zhang and Mingyu Lu
Electronics 2023, 12(3), 656; https://doi.org/10.3390/electronics12030656 - 28 Jan 2023
Viewed by 1642
Abstract
Authorship profiling is a subtask of authorship identification. This task can be regarded as an analysis of personal writing styles, which has been widely investigated. However, no previous studies have attempted to analyze the authorship of classical Chinese poetry. First, we provide an [...] Read more.
Authorship profiling is a subtask of authorship identification. This task can be regarded as an analysis of personal writing styles, which has been widely investigated. However, no previous studies have attempted to analyze the authorship of classical Chinese poetry. First, we provide an approach to evaluate the popularity of poets, and we also establish a public corpus containing the top 20 most popular poets in the Tang Dynasty for authorship profiling. Then, a novel poetry authorship profiling framework named multidimensional domain knowledge poet profiling (M-DKPP) is proposed, combining the knowledge of authorship attribution and the text’s stylistic features with domain knowledge described by experts in traditional poetry studies. A case study for Li Bai is used to prove the validity and applicability of our framework. Finally, the performance of M-DKPP framework is evaluated with four poem datasets. On all datasets, the proposed framework outperforms several baseline approaches for authorship attribution. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

20 pages, 1082 KiB  
Article
Self-Attentive Subset Learning over a Set-Based Preference in Recommendation
by Kunjia Liu, Yifan Chen, Jiuyang Tang, Hongbin Huang and Lihua Liu
Appl. Sci. 2023, 13(3), 1683; https://doi.org/10.3390/app13031683 - 28 Jan 2023
Cited by 1 | Viewed by 1352
Abstract
Recommender systems that learn user preference from item-level feedback (provided to individual items) have been extensively studied. Considering the risk of privacy exposure, learning from set-level feedback (provided to sets of items) has been demonstrated to be a better way, since set-level feedback [...] Read more.
Recommender systems that learn user preference from item-level feedback (provided to individual items) have been extensively studied. Considering the risk of privacy exposure, learning from set-level feedback (provided to sets of items) has been demonstrated to be a better way, since set-level feedback reveals user preferences while to some extent, hiding his/her privacy. Since only set-level feedback is provided as a supervision signal, different methods are being investigated to build connections between set-based preferences and item-based preferences. However, they overlook the complexity of user behavior in real-world applications. Instead, we observe that users’ set-level preference can be better modeled based on a subset of items in the original set. To this end, we propose to tackle the problem of identifying subsets from sets of items for set-based preference learning. We propose a policy network to explicitly learn a personalized subset selection strategy for users. Given the complex correlation between items in the set-rating process, we introduce a self-attention module to make sure all set members are considered in subset selecting process. Furthermore, we introduce gumble softmax to avoid gradient vanishing caused by binary selection in model learning. Finally the selected items are aggregated by user-specific personalized positional weights. Empirical evaluation with real-world datasets verifies the superiority of the proposed model over the state-of-the-art. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

21 pages, 2048 KiB  
Article
Estimation of Critical Collapse Solutions to Black Holes with Nonlinear Statistical Models
by Ehsan Hatefi and Armin Hatefi
Mathematics 2022, 10(23), 4537; https://doi.org/10.3390/math10234537 - 30 Nov 2022
Cited by 5 | Viewed by 1688
Abstract
The self-similar gravitational collapse solutions to the Einstein-axion–dilaton system have already been discovered. Those solutions become invariants after combining the spacetime dilation with the transformations of internal SL(2, R). We apply nonlinear statistical models to estimate the functions that appear in [...] Read more.
The self-similar gravitational collapse solutions to the Einstein-axion–dilaton system have already been discovered. Those solutions become invariants after combining the spacetime dilation with the transformations of internal SL(2, R). We apply nonlinear statistical models to estimate the functions that appear in the physics of Black Holes of the axion–dilaton system in four dimensions. These statistical models include parametric polynomial regression, nonparametric kernel regression and semi-parametric local polynomial regression models. Through various numerical studies, we reached accurate numerical and closed-form continuously differentiable estimates for the functions appearing in the metric and equations of motion. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

19 pages, 712 KiB  
Article
Hybrid Random Forest Survival Model to Predict Customer Membership Dropout
by Pedro Sobreiro, José Garcia-Alonso, Domingos Martinho and Javier Berrocal
Electronics 2022, 11(20), 3328; https://doi.org/10.3390/electronics11203328 - 15 Oct 2022
Viewed by 2215
Abstract
Dropout prediction is a problem that must be addressed in various organizations, as retaining customers is generally more profitable than attracting them. Existing approaches address the problem considering a dependent variable representing dropout or non-dropout, without considering the dynamic perspetive that the dropout [...] Read more.
Dropout prediction is a problem that must be addressed in various organizations, as retaining customers is generally more profitable than attracting them. Existing approaches address the problem considering a dependent variable representing dropout or non-dropout, without considering the dynamic perspetive that the dropout risk changes over time. To solve this problem, we explore the use of random survival forests combined with clusters, in order to evaluate whether the prediction performance improves. The model performance was determined using the concordance probability, Brier Score and the error in the prediction considering 5200 customers of a Health Club. Our results show that the prediction performance in the survival models increased substantially in the models using clusters rather than that without clusters, with a statistically significant difference between the models. The model using a hybrid approach improved the accuracy of the survival model, providing support to develop countermeasures considering the period in which dropout is likely to occur. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

19 pages, 2453 KiB  
Article
Personalized Search Using User Preferences on Social Media
by Kyoungsoo Bok, Jinwoo Song, Jongtae Lim and Jaesoo Yoo
Electronics 2022, 11(19), 3049; https://doi.org/10.3390/electronics11193049 - 24 Sep 2022
Cited by 2 | Viewed by 2909
Abstract
In contrast to traditional web search, personalized search provides search results that take into account the user’s preferences. However, the existing personalized search methods have limitations in providing appropriate search results for the individual’s preferences, because they do not consider the user’s recent [...] Read more.
In contrast to traditional web search, personalized search provides search results that take into account the user’s preferences. However, the existing personalized search methods have limitations in providing appropriate search results for the individual’s preferences, because they do not consider the user’s recent preferences or the preferences of other users. In this paper, we propose a new search method considering the user’s recent preferences and similar users’ preferences on social media analysis. Since the user expresses personal opinions on social media, it is possible to grasp the user preferences when analyzing the records of social media activities. The proposed method collects user social activity records and determines keywords of interest using TF-IDF. Since user preferences change continuously over time, we assign time weights to keywords of interest, giving many high values to state-of-the-art user preferences. We identify users with similar preferences to extend the search results to be provided to users because considering only user preferences in personalized searches can provide narrow search results. The proposed method provides personalized search results considering social characteristics by applying a ranking algorithm that considers similar user preferences as well as user preferences. It is shown through various performance evaluations that the proposed personalized search method outperforms the existing methods. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

21 pages, 7905 KiB  
Article
Ontology-Based Linked Data to Support Decision-Making within Universities
by Ghadeer Ashour, Ahmed Al-Dubai, Imed Romdhani and Daniyal Alghazzawi
Mathematics 2022, 10(17), 3148; https://doi.org/10.3390/math10173148 - 2 Sep 2022
Cited by 3 | Viewed by 1736
Abstract
In recent years, educational institutions have worked hard to automate their work using more trending technologies that prove the success in supporting decision-making processes. Most of the decisions in educational institutions rely on rating the academic research profiles of their staff. An enormous [...] Read more.
In recent years, educational institutions have worked hard to automate their work using more trending technologies that prove the success in supporting decision-making processes. Most of the decisions in educational institutions rely on rating the academic research profiles of their staff. An enormous amount of scholarly data is produced continuously by online libraries that contain data about publications, citations, and research activities. This kind of data can change the accuracy of the academic decisions, if linked with the local data of universities. The linked data technique in this study is applied to generate a link between university semantic data and a scientific knowledge graph, to enrich the local data and improve academic decisions. As a proof of concept, a case study was conducted to allocate the best academic staff to teach a course regarding their profile, including research records. Further, the resulting data are available to be reused in the future for different purposes in the academic domain. Finally, we compared the results of this link with previous work, as evidence of the accuracy of leveraging this technology to improve decisions within universities. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

14 pages, 537 KiB  
Article
Discriminating Pattern Mining for Diagnosing Reading Disorders
by Fabio Fassetti and Ilaria Fassetti
Appl. Sci. 2022, 12(15), 7540; https://doi.org/10.3390/app12157540 - 27 Jul 2022
Viewed by 1358
Abstract
Tachistoscopes are devices that display a word for several seconds and ask the user to write down the word. They have been widely employed to increase recognition speed, to increase reading comprehension and, especially, to individuate reading difficulties and disabilities. Once the therapist [...] Read more.
Tachistoscopes are devices that display a word for several seconds and ask the user to write down the word. They have been widely employed to increase recognition speed, to increase reading comprehension and, especially, to individuate reading difficulties and disabilities. Once the therapist is provided with the answers of the patients, a challenging problem is the analysis of the strings to individuate common patterns in the erroneous strings that could raise suspicion of related disabilities. In this direction, this work presents a machine learning technique aimed at mining exceptional string patterns and is precisely designed to tackle the above-mentioned problem. The technique is based on non-negative matrix factorization, nnmf, and exploits as features the structure of the words in terms of the letters composing them. To the best of our knowledge, this is the first attempt of mining tachistoscope answers to discover intrinsic peculiarities of the words possibly involved in reading disabilities. From the technical point of view, we present a novel variant of nnmf methods with the adjunctive goal of discriminating between sets. The technique has been experimented in a real case study with the help of an Italian speech therapist center that collaborate with this work. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

29 pages, 31681 KiB  
Article
A Study on the Geometric and Kinematic Descriptors of Trajectories in the Classification of Ship Types
by Yashar Tavakoli, Lourdes Peña-Castillo and Amilcar Soares
Sensors 2022, 22(15), 5588; https://doi.org/10.3390/s22155588 - 26 Jul 2022
Cited by 3 | Viewed by 2302
Abstract
The classification of ships based on their trajectory descriptors is a common practice that is helpful in various contexts, such as maritime security and traffic management. For the most part, the descriptors are either geometric, which capture the shape of a ship’s trajectory, [...] Read more.
The classification of ships based on their trajectory descriptors is a common practice that is helpful in various contexts, such as maritime security and traffic management. For the most part, the descriptors are either geometric, which capture the shape of a ship’s trajectory, or kinematic, which capture the motion properties of a ship’s movement. Understanding the implications of the type of descriptor that is used in classification is important for feature engineering and model interpretation. However, this matter has not yet been deeply studied. This article contributes to feature engineering within this field by introducing proper similarity measures between the descriptors and defining sound benchmark classifiers, based on which we compared the predictive performance of geometric and kinematic descriptors. The performance profiles of geometric and kinematic descriptors, along with several standard tools in interpretable machine learning, helped us provide an account of how different ships differ in movement. Our results indicated that the predictive performance of geometric and kinematic descriptors varied greatly, depending on the classification problem at hand. We also showed that the movement of certain ship classes solely differed geometrically while some other classes differed kinematically and that this difference could be formulated in simple terms. On the other hand, the movement characteristics of some other ship classes could not be delineated along these lines and were more complicated to express. Finally, this study verified the conjecture that the geometric–kinematic taxonomy could be further developed as a tool for more accessible feature selection. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

15 pages, 2080 KiB  
Article
Time-Weighted Community Search Based on Interest
by Jing Liu and Yong Zhong
Appl. Sci. 2022, 12(14), 7077; https://doi.org/10.3390/app12147077 - 13 Jul 2022
Cited by 1 | Viewed by 1495
Abstract
Community search aims to provide users with personalized community query services. It is a prerequisite for various recommendation systems and has received widespread attention from academia and industry. The existing literature has established various community search models and algorithms from different dimensions of [...] Read more.
Community search aims to provide users with personalized community query services. It is a prerequisite for various recommendation systems and has received widespread attention from academia and industry. The existing literature has established various community search models and algorithms from different dimensions of social networks. Unfortunately, they only judge the representative attributes of users according to the frequency of attribute keywords, completely ignoring the temporal characteristics of keywords. It is clear that a user’s interest changes over time, so it is essential to select users’ representative attributes in combination with time. Therefore, we propose a time-weighted community search model (TWC) based on user interests which fully considers the impact of time on user interests. TWC reduces the number of query parameters as much as possible and improves the usability of the model. We design the time-weighted decay function of the attribute. We then extract the user’s time-weighted representative attributes to express the user’s short-term interests more clearly in the query window. In addition, we propose a new attribute similarity scoring function and a community scoring function. To solve the TWC problem, we design and implement the Local Extend algorithm and the Shrink algorithm. Finally, we conduct extensive experiments on a real dataset to verify the superiority of the TWC model and the efficiency of the proposed algorithm. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

14 pages, 396 KiB  
Article
A Survey of Big Data Archives in Time-Domain Astronomy
by Manoj Poudel, Rashmi P. Sarode, Yutaka Watanobe, Maxim Mozgovoy and Subhash Bhalla
Appl. Sci. 2022, 12(12), 6202; https://doi.org/10.3390/app12126202 - 18 Jun 2022
Cited by 6 | Viewed by 2667
Abstract
The rise of big data has resulted in the proliferation of numerous heterogeneous data stores. Even though multiple models are used for integrating these data, combining such huge amounts of data into a single model remains challenging. There is a need in the [...] Read more.
The rise of big data has resulted in the proliferation of numerous heterogeneous data stores. Even though multiple models are used for integrating these data, combining such huge amounts of data into a single model remains challenging. There is a need in the database management archives to manage such huge volumes of data without any particular structure which comes from unconnected and unrelated sources. These data are growing in size and thus demand special attention. The speed with which these data are growing as well as the varied data types involved and stored in scientific archives is posing further challenges. Astronomy is also increasingly becoming a science which is now based on a lot of data processing and involves assorted data. These data are now stored in domain-specific archives. Many astronomical studies are producing large-scale archives of data and these archives are then published in the form of data repositories. These mainly consist of images and text without any structure in addition to data with some structure such as relations with key values. When the archives are published as remote data repositories, it is challenging work to organize the data against their increased diversity and to meet the information demands of users. To address this problem, polystore systems present a new model of data integration and have been proposed to access unrelated data repositories using an uniform single query language. This article highlights the polystore system for integrating large-scale heterogeneous data in the astronomy domain. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

22 pages, 3870 KiB  
Article
Improved Boundary Support Vector Clustering with Self-Adaption Support
by Huina Li, Yuan Ping, Bin Hao, Chun Guo and Yujian Liu
Electronics 2022, 11(12), 1854; https://doi.org/10.3390/electronics11121854 - 11 Jun 2022
Cited by 2 | Viewed by 1476
Abstract
Concerning the good description of arbitrarily shaped clusters, collecting accurate support vectors (SVs) is critical yet resource-consuming for support vector clustering (SVC). Even though SVs can be extracted from the boundaries for efficiency, boundary patterns with too much noise and inappropriate parameter settings, [...] Read more.
Concerning the good description of arbitrarily shaped clusters, collecting accurate support vectors (SVs) is critical yet resource-consuming for support vector clustering (SVC). Even though SVs can be extracted from the boundaries for efficiency, boundary patterns with too much noise and inappropriate parameter settings, such as the kernel width, also confuse the connectivity analysis. Thus, we propose an improved boundary SVC (IBSVC) with self-adaption support for reasonable boundaries and comfortable parameters. The first self-adaption is in the movable edge selection (MES). By introducing a divide-and-conquer strategy with the k-means++ support, it collects local, informative, and reasonable edges for the minimal hypersphere construction while rejecting pseudo-borders and outliers. Rather than the execution of model learning with repetitive training and evaluation, we fuse the second self-adaption with the flexible parameter selection (FPS) for direct model construction. FPS automatically selects the kernel width to meet a conformity constraint, which is defined by measuring the difference between the data description drawn by the model and the actual pattern. Finally, IBSVC adopts a convex decomposition-based strategy to finish cluster checking and labeling even though there is no prior knowledge of the cluster number. Theoretical analysis and experimental results confirm that IBSVC can discover clusters with high computational efficiency and applicability. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

13 pages, 2178 KiB  
Article
An Efficient AdaBoost Algorithm with the Multiple Thresholds Classification
by Yi Ding, Hongyang Zhu, Ruyun Chen and Ronghui Li
Appl. Sci. 2022, 12(12), 5872; https://doi.org/10.3390/app12125872 - 9 Jun 2022
Cited by 37 | Viewed by 4796
Abstract
Adaptive boost (AdaBoost) is a prominent example of an ensemble learning algorithm that combines weak classifiers into strong classifiers through weighted majority voting rules. AdaBoost’s weak classifier, with threshold classification, tries to find the best threshold in one of the data dimensions, dividing [...] Read more.
Adaptive boost (AdaBoost) is a prominent example of an ensemble learning algorithm that combines weak classifiers into strong classifiers through weighted majority voting rules. AdaBoost’s weak classifier, with threshold classification, tries to find the best threshold in one of the data dimensions, dividing the data into two categories-1 and 1. However, in some cases, this Weak Learning algorithm is not accurate enough, showing poor generalization performance and a tendency to over-fit. To solve these challenges, we first propose a new Weak Learning algorithm that classifies examples based on multiple thresholds, rather than only one, to improve its accuracy. Second, in this paper, we make changes to the weight allocation scheme of the Weak Learning algorithm based on the AdaBoost algorithm to use potential values of other dimensions in the classification process, while the theoretical identification is provided to show its generality. Finally, comparative experiments between the two algorithms on 18 datasets on UCI show that our improved AdaBoost algorithm has a better generalization effect in the test set during the training iteration. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

18 pages, 5691 KiB  
Article
ASAD: Adaptive Seasonality Anomaly Detection Algorithm under Intricate KPI Profiles
by Hao Wang, Yuanyuan Zhang, Yijia Liu, Fenglin Liu, Hanyang Zhang, Bin Xing, Minghai Xing, Qiong Wu and Liangyin Chen
Appl. Sci. 2022, 12(12), 5855; https://doi.org/10.3390/app12125855 - 8 Jun 2022
Viewed by 2129
Abstract
Anomaly detection is the foundation of intelligent operation and maintenance (O&M), and detection objects are evaluated by key performance indicators (KPIs). For almost all computer O&M systems, KPIs are usually the machine-level operating data. Moreover, these high-frequency KPIs show a non-Gaussian distribution and [...] Read more.
Anomaly detection is the foundation of intelligent operation and maintenance (O&M), and detection objects are evaluated by key performance indicators (KPIs). For almost all computer O&M systems, KPIs are usually the machine-level operating data. Moreover, these high-frequency KPIs show a non-Gaussian distribution and are hard to model, i.e., they are intricate KPI profiles. However, existing anomaly detection techniques are incapable of adapting to intricate KPI profiles. In order to enhance the performance under intricate KPI profiles, this study presents a seasonal adaptive KPI anomaly detection algorithm ASAD (Adaptive Seasonality Anomaly Detection). We also propose a new eBeats clustering algorithm and calendar-based correlation method to further reduce the detection time and error. Through experimental tests, our ASAD algorithm has the best overall performance compared to other KPI anomaly detection methods. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

24 pages, 11755 KiB  
Article
A Novel Unified Data Modeling Method for Equipment Lifecycle Integrated Logistics Support
by Xuemiao Cui, Jiping Lu and Yafeng Han
Sensors 2022, 22(11), 4265; https://doi.org/10.3390/s22114265 - 3 Jun 2022
Viewed by 2920
Abstract
Integrated logistics support (ILS) is of great significance for maintaining equipment operational capability in the whole lifecycle. Numerous segments and complex product objects exist in the process of equipment ILS, which gives ILS data multi-source, heterogeneous, and multidimensional characteristics. The present ILS data [...] Read more.
Integrated logistics support (ILS) is of great significance for maintaining equipment operational capability in the whole lifecycle. Numerous segments and complex product objects exist in the process of equipment ILS, which gives ILS data multi-source, heterogeneous, and multidimensional characteristics. The present ILS data cannot satisfy the demand for efficient utilization. Therefore, the unified modeling of ILS data is extremely urgent and significant. In this paper, a unified data modeling method is proposed to solve the consistent and comprehensive expression problem of ILS data. Firstly, a four-tier unified data modeling framework is constructed based on the analysis of ILS data characteristics. Secondly, the Core unified data model, Domain unified data model, and Instantiated unified data model are built successively. Then, the expressions of ILS data in the three dimensions of time, product, and activity are analyzed. Thirdly, the Lifecycle ILS unified data model is constructed, and the multidimensional information retrieval methods are discussed. Based on these, different systems in the equipment ILS process can share a set of data models and provide ILS designers with relevant data through different views. Finally, the practical ILS data models are constructed based on the developed unified data modeling software prototype, which verifies the feasibility of the proposed method. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

18 pages, 1181 KiB  
Article
A Generalized Family of Exponentiated Composite Distributions
by Bowen Liu and Malwane M. A. Ananda
Mathematics 2022, 10(11), 1895; https://doi.org/10.3390/math10111895 - 1 Jun 2022
Cited by 3 | Viewed by 2299
Abstract
In this paper, we propose a new family of distributions, by exponentiating the random variables associated with the probability density functions of composite distributions. We also derive some mathematical properties of this new family of distributions, including the moments and the limited moments. [...] Read more.
In this paper, we propose a new family of distributions, by exponentiating the random variables associated with the probability density functions of composite distributions. We also derive some mathematical properties of this new family of distributions, including the moments and the limited moments. Specifically, two special models in this family are discussed. Three real datasets were chosen, to assess the performance of these two special exponentiated-composite models. When fitting to these three datasets, these three special exponentiated-composite distributions demonstrate significantly better performance, compared to the original composite distributions. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

28 pages, 10531 KiB  
Article
Heterogeneous Ensemble Deep Learning Model for Enhanced Arabic Sentiment Analysis
by Hager Saleh, Sherif Mostafa, Abdullah Alharbi, Shaker El-Sappagh and Tamim Alkhalifah
Sensors 2022, 22(10), 3707; https://doi.org/10.3390/s22103707 - 12 May 2022
Cited by 44 | Viewed by 4390
Abstract
Sentiment analysis was nominated as a hot research topic a decade ago for its increasing importance in analyzing the people’s opinions extracted from social media platforms. Although the Arabic language has a significant share of the content shared across social media platforms, its [...] Read more.
Sentiment analysis was nominated as a hot research topic a decade ago for its increasing importance in analyzing the people’s opinions extracted from social media platforms. Although the Arabic language has a significant share of the content shared across social media platforms, its content’s sentiment analysis is still limited due to its complex morphological structures and the varieties of dialects. Traditional machine learning and deep neural algorithms have been used in a variety of studies to predict sentiment analysis. Therefore, a need of changing current mechanisms is required to increase the accuracy of sentiment analysis prediction. This paper proposed an optimized heterogeneous stacking ensemble model for enhancing the performance of Arabic sentiment analysis. The proposed model combines three different of pre-trained Deep Learning (DL) models: Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) in conjunction with three meta-learners Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM) in order to enhance model’s performance for predicting Arabic sentiment analysis. The performance of the proposed model with RNN, LSTM, GRU, and the five regular ML techniques: Decision Tree (DT), LR, K-Nearest Neighbor (KNN), RF, and Naive Bayes (NB) are compared using three benchmarks Arabic dataset. Parameters of Machine Learning (ML) and DL are optimized using Grid search and KerasTuner, respectively. Accuracy, precision, recall, and f1-score were applied to evaluate the performance of the models and validate the results. The results show that the proposed ensemble model has achieved the best performance for each dataset compared with other models. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

22 pages, 2100 KiB  
Article
Attention-Based Transformer-BiGRU for Question Classification
by Dongfang Han, Turdi Tohti and Askar Hamdulla
Information 2022, 13(5), 214; https://doi.org/10.3390/info13050214 - 20 Apr 2022
Cited by 4 | Viewed by 4268
Abstract
A question answering (QA) system is a research direction in the field of artificial intelligence and natural language processing (NLP) that has attracted much attention and has broad development prospects. As one of the main components in the QA system, the accuracy of [...] Read more.
A question answering (QA) system is a research direction in the field of artificial intelligence and natural language processing (NLP) that has attracted much attention and has broad development prospects. As one of the main components in the QA system, the accuracy of question classification plays a key role in the entire QA task. Therefore, not only the traditional machine learning methods but also today’s deep learning methods are widely used and deeply studied in question classification tasks. This paper mainly introduces our work on two aspects of Chinese question classification. The first is to use an answer-driven method to build a richer Chinese question classification dataset for the small-scale problems of the existing experimental dataset, which has a certain reference value for the expansion of the dataset, especially for the construction of those low-resource language datasets. The second is to propose a deep learning model of problem classification with a Transformer + Bi-GRU + Attention structure. Transformer has strong learning and coding ability, but it adopts the scheme of fixed coding length, which divides the long text into multiple segments, and each segment is coded separately; there is no interaction that occurs between segments. Here, we achieve the information interaction between segments through Bi-GRU so as to improve the coding effect of long sentences. Our purpose of adding the Attention mechanism is to highlight the key semantics in questions that contain answers. The experimental results show that the model proposed in this paper has significantly improved the accuracy of question classification. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

11 pages, 415 KiB  
Article
Measurement and Analysis of High Frequency Assert Volatility Based on Functional Data Analysis
by Zhenjie Liang, Futian Weng, Yuanting Ma, Yan Xu, Miao Zhu and Cai Yang
Mathematics 2022, 10(7), 1140; https://doi.org/10.3390/math10071140 - 1 Apr 2022
Cited by 3 | Viewed by 2203
Abstract
Information and communication technology have enabled the collection of high-frequency financial asset time series data. However, the high spatial and temporal resolution nature of these data makes it challenging to compare financial asset characteristics patterns and identify the risk. To address this challenge, [...] Read more.
Information and communication technology have enabled the collection of high-frequency financial asset time series data. However, the high spatial and temporal resolution nature of these data makes it challenging to compare financial asset characteristics patterns and identify the risk. To address this challenge, a method for realized volatility calculation based on the functional data analysis (FDA) method is proposed. A time–price functional curve is constructed by the functional data analysis method to calculate the realized volatility as the curvature integral of the time–price functional curve. This method could effectively eliminate the interference of market microstructure noise, which could not only allow capital asset price to be decomposed into a continuous term and a noise term by asymptotic convergence, but also could decouple the noise from the discrete-time series. Additionally, it could obtain the value of volatility at any given time, which is no concern about correlations between repeated, mixed frequencies and unequal intervals sampling problems and relaxes the structural constraints and distribution setting of data acquisition. To demonstrate our methods, we analyze a per-second level financial asset dataset. Additionally, sensitivity analysis on the selection of the no equally spaced sample is conducted, and we further add noise to ensure the robustness of our methods and discuss their implications in practice, especially being conducive to more micro analysis of the volatility of the financial market and understanding the rapidly changing changes. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

18 pages, 3014 KiB  
Article
Improved Optical Flow Estimation Method for Deepfake Videos
by Ali Bou Nassif, Qassim Nasir, Manar Abu Talib and Omar Mohamed Gouda
Sensors 2022, 22(7), 2500; https://doi.org/10.3390/s22072500 - 24 Mar 2022
Cited by 4 | Viewed by 4324
Abstract
Creating deepfake multimedia, and especially deepfake videos, has become much easier these days due to the availability of deepfake tools and the virtually unlimited numbers of face images found online. Research and industry communities have dedicated time and resources to develop detection methods [...] Read more.
Creating deepfake multimedia, and especially deepfake videos, has become much easier these days due to the availability of deepfake tools and the virtually unlimited numbers of face images found online. Research and industry communities have dedicated time and resources to develop detection methods to expose these fake videos. Although these detection methods have been developed over the past few years, synthesis methods have also made progress, allowing for the production of deepfake videos that are harder and harder to differentiate from real videos. This research paper proposes an improved optical flow estimation-based method to detect and expose the discrepancies between video frames. Augmentation and modification are experimented upon to try to improve the system’s overall accuracy. Furthermore, the system is trained on graphics processing units (GPUs) and tensor processing units (TPUs) to explore the effects and benefits of each type of hardware in deepfake detection. TPUs were found to have shorter training times compared to GPUs. VGG-16 is the best performing model when used as a backbone for the system, as it achieved around 82.0% detection accuracy when trained on GPUs and 71.34% accuracy on TPUs. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

17 pages, 12296 KiB  
Article
A Multi-Entity Knowledge Joint Extraction Method of Communication Equipment Faults for Industrial IoT
by Kun Liang, Baoxian Zhou, Yiying Zhang, Yeshen He, Xiaoyan Guo and Bo Zhang
Electronics 2022, 11(7), 979; https://doi.org/10.3390/electronics11070979 - 22 Mar 2022
Cited by 8 | Viewed by 1936
Abstract
The Industrial Internet of Things (IIoT) deploys massive communication devices for information collection and process control. Once it reaches failure, it will seriously affect the operation of the industrial system. This paper proposes a new method for multi-entity knowledge joint extraction (MEKJE) of [...] Read more.
The Industrial Internet of Things (IIoT) deploys massive communication devices for information collection and process control. Once it reaches failure, it will seriously affect the operation of the industrial system. This paper proposes a new method for multi-entity knowledge joint extraction (MEKJE) of IIoT communication equipment faults. This method constructs a multi-task tightly coupled model of fault entity and relationship extraction. We use it to implement word embedding and bidirectional semantic capture to generate computable text vectors. At the same time, a multi-entity segmentation method is proposed, which uses noise filtering to distinguish the multi-fault relationship of single corpus. We constructed a dataset of communication failures in power IIoT and conducted experiments. The experimental results show that the method performs best in tests with the Faulty Text dataset and the CLUENER dataset. In particular, the model achieves an F1 value of 78.6% in the evaluation of relationship extraction for multiple entities, and a significant improvement of 5–8% in its accuracy and recall. It enables effective mapping and accurate extraction of fault knowledge data. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

13 pages, 1163 KiB  
Article
Online Anomaly Detection for Smartphone-Based Multivariate Behavioral Time Series Data
by Gang Liu and Jukka-Pekka Onnela
Sensors 2022, 22(6), 2110; https://doi.org/10.3390/s22062110 - 9 Mar 2022
Cited by 3 | Viewed by 2506
Abstract
Smartphones can be used to collect granular behavioral data unobtrusively, over long time periods, in real-world settings. To detect aberrant behaviors in large volumes of passively collected smartphone data, we propose an online anomaly detection method using Hotelling’s T-squared test. The test statistic [...] Read more.
Smartphones can be used to collect granular behavioral data unobtrusively, over long time periods, in real-world settings. To detect aberrant behaviors in large volumes of passively collected smartphone data, we propose an online anomaly detection method using Hotelling’s T-squared test. The test statistic in our method was a weighted average, with more weight on the between-individual component when the amount of data available for the individual was limited and more weight on the within-individual component when the data were adequate. The algorithm took only an O(1) runtime in each update, and the required memory usage was fixed after a pre-specified number of updates. The performance of the proposed method, in terms of accuracy, sensitivity, and specificity, was consistently better than or equal to the offline method that it was built upon, depending on the sample size of the individual data. Future applications of our method include early detection of surgical complications during recovery and the possible prevention of the relapse of patients with serious mental illness. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

23 pages, 808 KiB  
Article
Processing Analytical Queries over Polystore System for a Large Astronomy Data Repository
by Manoj Poudel, Rashmi P. Sarode, Yutaka Watanobe, Maxim Mozgovoy and Subhash Bhalla
Appl. Sci. 2022, 12(5), 2663; https://doi.org/10.3390/app12052663 - 4 Mar 2022
Cited by 2 | Viewed by 2583
Abstract
There are extremely large heterogeneous databases in the astronomical data domain, which keep increasing in size. The data types vary from images of astronomical objects to unstructured texts, relations, and key-values. Many astronomical data repositories manage such kinds of data. The Zwicky Transient [...] Read more.
There are extremely large heterogeneous databases in the astronomical data domain, which keep increasing in size. The data types vary from images of astronomical objects to unstructured texts, relations, and key-values. Many astronomical data repositories manage such kinds of data. The Zwicky Transient Facility (ZTF) is one such data repository with a large amount of data with different varieties. Handling different types of data in a single database may have performance and efficiency issues. In this study, we propose a web-based query system built around the Polystore database architecture, and attempt to provide a solution for the growing size of data in the astronomical domain. The proposed system will unify querying over multiple datasets directly, thereby eliminating the effort to translate complex queries and simplify the work for the users in the astronomical domain. In this proposal, we study the models of data integration, analyze them, and incorporate them into a system to manage linked open data provided by astronomical domain. The proposed system is scalable, and its model can be used for various other systems to efficiently manage heterogeneous data. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

10 pages, 591 KiB  
Article
Effective Transfer Learning with Label-Based Discriminative Feature Learning
by Gyunyeop Kim and Sangwoo Kang
Sensors 2022, 22(5), 2025; https://doi.org/10.3390/s22052025 - 4 Mar 2022
Cited by 3 | Viewed by 1990
Abstract
The performance of natural language processing with a transfer learning methodology has improved by applying pre-training language models to downstream tasks with a large number of general data. However, because the data used in pre-training are irrelevant to the downstream tasks, a problem [...] Read more.
The performance of natural language processing with a transfer learning methodology has improved by applying pre-training language models to downstream tasks with a large number of general data. However, because the data used in pre-training are irrelevant to the downstream tasks, a problem occurs in that it learns general features rather than those features specific to the downstream tasks. In this paper, a novel learning method is proposed for embedding pre-trained models to learn specific features of such tasks. The proposed method learns the label features of downstream tasks through contrast learning using label embedding and sampled data pairs. To demonstrate the performance of the proposed method, we conducted experiments on sentence classification datasets and evaluated whether the features of the downstream tasks have been learned through a PCA and a clustering of the embeddings. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

11 pages, 2532 KiB  
Article
A New Setup for the Measurement of Total Organic Carbon in Ultrapure Water Systems
by Sara H. Schäfer, Katharina van Dyk, Johannes Warmer, Torsten C. Schmidt and Peter Kaul
Sensors 2022, 22(5), 2004; https://doi.org/10.3390/s22052004 - 4 Mar 2022
Cited by 3 | Viewed by 2771
Abstract
With the increasing demand for ultrapure water in the pharmaceutical and semiconductor industry, the need for precise measuring instruments for those applications is also growing. One critical parameter of water quality is the amount of total organic carbon (TOC). This work presents a [...] Read more.
With the increasing demand for ultrapure water in the pharmaceutical and semiconductor industry, the need for precise measuring instruments for those applications is also growing. One critical parameter of water quality is the amount of total organic carbon (TOC). This work presents a system that uses the advantage of the increased oxidation power achieved with UV/O3 advanced oxidation process (AOP) for TOC measurement in combination with a significant miniaturization compared to the state of the art. The miniaturization is achieved by using polymer-electrolyte membrane (PEM) electrolysis cells for ozone generation in combination with UV-LEDs for irradiation of the measuring solution, as both components are significantly smaller than standard equipment. Conductivity measurement after oxidation is the measuring principle and measurements were carried out in the TOC range between 10 and 1000 ppb TOC. The suitability of the system for TOC measurement is demonstrated using the oxidation by ozonation combined with UV irradiation of defined concentrations of isopropyl alcohol (IPA). Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

14 pages, 18906 KiB  
Article
Sand Dust Images Enhancement Based on Red and Blue Channels
by Fei Shi, Zhenhong Jia, Huicheng Lai, Sensen Song and Junnan Wang
Sensors 2022, 22(5), 1918; https://doi.org/10.3390/s22051918 - 1 Mar 2022
Cited by 9 | Viewed by 3186
Abstract
The scattering and absorption of light results in the degradation of image in sandstorm scenes, it is vulnerable to issues such as color casting, low contrast and lost details, resulting in poor visual quality. In such circumstances, traditional image restoration methods cannot fully [...] Read more.
The scattering and absorption of light results in the degradation of image in sandstorm scenes, it is vulnerable to issues such as color casting, low contrast and lost details, resulting in poor visual quality. In such circumstances, traditional image restoration methods cannot fully restore images owing to the persistence of color casting problems and the poor estimation of scene transmission maps and atmospheric light. To effectively correct color casting and enhance visibility for such sand dust images, we proposed a sand dust image enhancement algorithm using the red and blue channels, which consists of two modules: the red channel-based correction function (RCC) and blue channel-based dust particle removal (BDPR), the RCC module is used to correct color casting errors, and the BDPR module removes sand dust particles. After the dust image is processed by these two modules, a clear and visible image can be produced. The experimental results were analyzed qualitatively and quantitatively, and the results show that this method can significantly improve the image quality under sandstorm weather and outperform the state-of-the-art restoration algorithms. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

27 pages, 97700 KiB  
Article
Geochemical Association Rules of Elements Mined Using Clustered Events of Spatial Autocorrelation: A Case Study in the Chahanwusu River Area, Qinghai Province, China
by Baoyi Zhang, Zhengwen Jiang, Yiru Chen, Nanwei Cheng, Umair Khan and Jiqiu Deng
Appl. Sci. 2022, 12(4), 2247; https://doi.org/10.3390/app12042247 - 21 Feb 2022
Cited by 5 | Viewed by 2325
Abstract
The spatial distribution of elements can be regarded as a numerical field of concentration values with a continuous spatial coverage. An active area of research is to discover geologically meaningful relationships among elements from their spatial distribution. To solve this problem, we proposed [...] Read more.
The spatial distribution of elements can be regarded as a numerical field of concentration values with a continuous spatial coverage. An active area of research is to discover geologically meaningful relationships among elements from their spatial distribution. To solve this problem, we proposed an association rule mining method based on clustered events of spatial autocorrelation and applied it to the polymetallic deposits of the Chahanwusu River area, Qinghai Province, China. The elemental data for stream sediments were first clustered into HH (high–high), LL (low–low), HL (high–low), and LH (low–high) groups by using local Moran’s I clustering map (LMIC). Then, the Apriori algorithm was used to mine the association rules among different elements in these clusters. More than 86% of the mined rule points are located within 1000 m of faults and near known ore occurrences and occur in the upper reaches of the stream and catchment areas. In addition, we found that the Middle Triassic granodiorite is enriched in sulfophile elements, e.g., Zn, Ag, and Cd, and the Early Permian granite quartz diorite (P1γδο) coexists with Cu and associated elements. Therefore, the proposed algorithm is an effective method for mining coexistence patterns of elements and provides an insight into their enrichment mechanisms. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

12 pages, 340 KiB  
Article
Study on Resistant Hierarchical Fuzzy Neural Networks
by Fengyu Gao, Jer-Guang Hsieh, Ying-Sheng Kuo and Jyh-Horng Jeng
Electronics 2022, 11(4), 598; https://doi.org/10.3390/electronics11040598 - 15 Feb 2022
Cited by 2 | Viewed by 2083
Abstract
Novel resistant hierarchical fuzzy neural networks are proposed in this study and their deep learning problems are investigated. These fuzzy neural networks can be used to model complex controlled plants and can also be used as fuzzy controllers. In general, real-world data are [...] Read more.
Novel resistant hierarchical fuzzy neural networks are proposed in this study and their deep learning problems are investigated. These fuzzy neural networks can be used to model complex controlled plants and can also be used as fuzzy controllers. In general, real-world data are usually contaminated by outliers. These outliers may have undesirable or unpredictable influences on the final learning machines. The correlations between the target and each of the predictors are utilized to partition input variables into groups so that each group becomes the input variables of a fuzzy system in each level of the hierarchical fuzzy neural network. In order to enhance the resistance of the learning machines, we use the least trimmed squared error as the cost function. To test the resistance of learning machines to adverse effects of outliers, we add at the output node some noise from three different types of distributions, namely, normal, Laplace, and uniform distributions. Real-world datasets are used to compare the performances of the proposed resistant hierarchical fuzzy neural networks, resistant densely connected artificial neural networks, and densely connected artificial neural networks without noise. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

26 pages, 1951 KiB  
Article
A Multi-Technique Approach to Exploring the Main Influences of Information Exchange Monitoring Tolerance
by Daniel Homocianu
Electronics 2022, 11(4), 528; https://doi.org/10.3390/electronics11040528 - 10 Feb 2022
Viewed by 1968
Abstract
The privacy and security of online transactions and information exchange has always been a critical issue of e-commerce. However, there is a certain level of tolerance (a share of 36%) when it comes to so-called governments’ rights to monitor electronic mail messages and [...] Read more.
The privacy and security of online transactions and information exchange has always been a critical issue of e-commerce. However, there is a certain level of tolerance (a share of 36%) when it comes to so-called governments’ rights to monitor electronic mail messages and other information exchange as resulting from the answers of respondents from 51 countries in the latest wave (2017–2020) of the World Values Survey. Consequently, the purpose of this study is to discover the most significant influences associated with this type of tolerance and even causal relationships. The variables have been selected and analyzed in many rounds (Adaptive Boosting, LASSO, mixed-effects modeling, and different regressions) with the aid of a private cloud. The results confirmed most hypotheses regarding the overwhelming role of trust, public surveillance acceptance, and some attitudes indicating conscientiousness, altruistic behavior, and gender discrimination acceptance in models with good-to-excellent classification accuracy. A generated prediction nomogram included 10 ten most resilient influences. Another one contained only 5 of these 10 that acted more as determinants resisting reverse causality checks. In addition, some sociodemographic controls indicated significant variables afferent to the highest education level attained, settlement size, and marital status. The paper’s novelty stands on many robust techniques supporting randomly and nonrandomly cross-validated and fully reproducible results based on a large amount and variety of evidence. The findings also represent a step forward in research related to privacy and security issues in e-commerce. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

14 pages, 1167 KiB  
Article
Group Testing with Consideration of the Dilution Effect
by Haoran Jiang, Hongshik Ahn and Xiaolin Li
Mathematics 2022, 10(3), 497; https://doi.org/10.3390/math10030497 - 3 Feb 2022
Cited by 3 | Viewed by 2099
Abstract
We propose a method of group testing by taking dilution effects into consideration. We estimate the dilution effect based on massively collected RT-PCR threshold cycle data and incorporate them into optimizing group tests. The new constraint helps find a robust solution of a [...] Read more.
We propose a method of group testing by taking dilution effects into consideration. We estimate the dilution effect based on massively collected RT-PCR threshold cycle data and incorporate them into optimizing group tests. The new constraint helps find a robust solution of a nonlinear equation. The proposed framework has the flexibility to incorporate geographic and demographic information. We conduct a Monte Carlo simulation to compare different group testing approaches under the estimated dilution effect. This study suggests that increased group size adversely impacts the false negative rate significantly when the infection rate is relatively low. Group tests with optimal pool sizes improve the sensitivity over group tests with a fixed pool size. Based on our simulation study, we recommend single group testing with optimal group sizes. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

24 pages, 638 KiB  
Article
Knowledge Source Rankings for Semi-Supervised Topic Modeling
by Justin Wood, Corey Arnold and Wei Wang
Information 2022, 13(2), 57; https://doi.org/10.3390/info13020057 - 24 Jan 2022
Cited by 1 | Viewed by 2947
Abstract
Recent work suggests knowledge sources can be added into the topic modeling process to label topics and improve topic discovery. The knowledge sources typically consist of a collection of human-constructed articles, each describing a topic (article-topic) for an entire domain. However, these semisupervised [...] Read more.
Recent work suggests knowledge sources can be added into the topic modeling process to label topics and improve topic discovery. The knowledge sources typically consist of a collection of human-constructed articles, each describing a topic (article-topic) for an entire domain. However, these semisupervised topic models assume a corpus to contain topics on only a subset of a domain. Therefore, during inference, the model must consider which article-topics were theoretically used to generate the corpus. Since the knowledge sources tend to be quite large, the many article-topics considered slow down the inference process. The increase in execution time is significant, with knowledge source input greater than 103 becoming unfeasible for use in topic modeling. To increase the applicability of semisupervised topic models, approaches are needed to speed up the overall execution time. This paper presents a way of ranking knowledge source topics to satisfy the above goal. Our approach utilizes a knowledge source ranking, based on the PageRank algorithm, to determine the importance of an article-topic. By applying our ranking technique we can eliminate low scoring article-topics before inference, speeding up the overall process. Remarkably, this ranking technique can also improve perplexity and interpretability. Results show our approach to outperform baseline methods and significantly aid semisupervised topic models. In our evaluation, knowledge source rankings yield a 44% increase in topic retrieval f-score, a 42.6% increase in inter-inference topic elimination, a 64% increase in perplexity, a 30% increase in token assignment accuracy, a 20% increase in topic composition interpretability, and a 5% increase in document assignment interpretability over baseline methods. Full article
(This article belongs to the Topic Data Science and Knowledge Discovery)
Show Figures

Figure 1

Back to TopTop