applsci-logo

Journal Browser

Journal Browser

Intelligent Data Mining, Analysis and Modeling Based on Machine Learning

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 September 2024) | Viewed by 13847

Special Issue Editors


E-Mail Website
Guest Editor
College of Information Sciences and Technology, Beijing University of Chemical Technology, Beijing 100029, China
Interests: spatio-temporal big data analysis; artificial intelligence; deep learning; geographic Information science
School of Computer Science, Beijing University of Technology, Beijing 100124, China
Interests: spatio-temporal data analysis and positioning algorithms; geosocial data mining; information retrieval
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
Interests: information security; machine learning

Special Issue Information

Dear Colleagues,

In the realm of data-driven exploration, algorithms seamlessly intertwine with the digital landscape. Our focus converges at the forefront of Intelligent Data Mining, Analysis, and Modeling. This theme delves into the profound integration of machine learning techniques with the domains of data excavation, analysis, and model construction. Within this sphere, we embrace and surmount novel challenges. This Special Issue aims to present pioneering ideas and experimental outcomes in the domain of machine learning-based data mining, spanning from design, services, and theory to practical applications. It serves as a platform for the unveiling of breakthrough concepts and empirical discoveries, encompassing foundational theories to real-world implementations. Join us in exploring the transformative potential of machine learning within the realm of intelligent data exploration.

This Special Issue will publish high-quality and original research papers in the overlapping fields of:

  • Artificial intelligence;
  • Machine learning and deep learning;
  • Computational and data science;
  • Data integration and preprocessing;
  • Modeling methods and techniques;
  • Big data applications and algorithms;
  • Physics-informed neural network;
  • Spatiotemporal big data.

Dr. Danhuai Guo
Dr. Zhi Cai
Dr. Yuping Lai
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • artificial intelligence
  • machine learning and deep learning
  • computational and data science
  • data integration and preprocessing
  • modeling methods and techniques
  • big data applications and algorithms
  • physics-informed neural network
  • spatiotemporal big data

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 3646 KiB  
Article
Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size
by Shinya Watanuki, Katsue Edo and Toshihiko Miura
Appl. Sci. 2024, 14(19), 9030; https://doi.org/10.3390/app14199030 - 6 Oct 2024
Viewed by 1037
Abstract
Questionnaire consumer survey research is primarily used for marketing research. To obtain credible results, collecting responses from numerous participants is necessary. However, two crucial challenges prevent marketers from conducting large-sample size surveys. The first is cost, as organizations with limited marketing budgets struggle [...] Read more.
Questionnaire consumer survey research is primarily used for marketing research. To obtain credible results, collecting responses from numerous participants is necessary. However, two crucial challenges prevent marketers from conducting large-sample size surveys. The first is cost, as organizations with limited marketing budgets struggle to gather sufficient data. The second involves rare population groups, where it is difficult to obtain representative samples. Furthermore, the increasing awareness of privacy and security concerns has made it challenging to ask sensitive and personal questions, further complicating respondent recruitment. To address these challenges, we augmented small-sized datawith synthesized data generated using deep generative neural networks (DGNNs). The synthesized data from three types of DGNNs (CTGAN, TVAE, and CopulaGAN) were based on seed data. For validation, 11 datasets were prepared: real data (original and seed), synthesized data (CTGAN, TVAE, and CopulaGAN), and augmented data (original + CTGAN, original + TVAE, original + CopulaGAN, seed + CTGAN, seed + TVAE, and seed + CopulaGAN). The large-sample-sized data, termed “original data”, served as the benchmark, whereas the small-sample-sized data acted as the foundation for synthesizing additional data. These datasets were evaluated using machine learning algorithms, particularly focusing on classification tasks. Conclusively, augmenting and synthesizing consumer survey data have shown potential in enhancing predictive performance, irrespective of the dataset’s size. Nonetheless, the challenge remains to minimize discrepancies between the original data and other datasets concerning the values and orders of feature importance. Although the efficacy of all three approaches should be improved in future work, CopulaGAN more accurately grasps the dependencies between the variables in table data compared with the other two DGNNs. The results provide cues for augmenting data with dependencies between variables in various fields. Full article
Show Figures

Figure 1

28 pages, 9299 KiB  
Article
Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems
by Yiying Wang, Jinghua Li, Boxin Yang, Dening Song and Lei Zhou
Appl. Sci. 2024, 14(17), 7466; https://doi.org/10.3390/app14177466 - 23 Aug 2024
Viewed by 464
Abstract
Neural network models, such as BP, LSTM, etc., support only numerical inputs, so data preprocessing needs to be carried out on the categorical variables to convert them into numerical data. For unordered multi-categorical variables, existing encoding methods may produce dimensional catastrophes and may [...] Read more.
Neural network models, such as BP, LSTM, etc., support only numerical inputs, so data preprocessing needs to be carried out on the categorical variables to convert them into numerical data. For unordered multi-categorical variables, existing encoding methods may produce dimensional catastrophes and may also introduce additional order misrepresentation and distance bias in neural network computation. To solve the above problems, this paper proposes an unordered multi-categorical variable encoding method O-AE using orthogonal matrix for encoding and encoding representation learning and dimensionality reduction via an autoencoder. Bayesian optimization is used for hyperparameter optimization of the autoencoder. Finally, seven experiments were designed with the basic O-AE, Bayesian optimization of the hyperparameters of the autoencoder for O-AE, and other encoding methods to encode unordered multi-categorical variables in five datasets, and they were input into a BP neural network to carry out target prediction experiments. The results show that the experiments using O-AE and O-AE-b have better prediction results, proving that the method proposed in this paper is highly feasible and applicable and can be an optional method for the data processing of unordered multi-categorical variables. Full article
Show Figures

Figure 1

16 pages, 608 KiB  
Article
A Spectral Clustering Algorithm for Non-Linear Graph Embedding in Information Networks
by Li Ni, Peng Manman and Wu Qiang
Appl. Sci. 2024, 14(11), 4946; https://doi.org/10.3390/app14114946 - 6 Jun 2024
Cited by 2 | Viewed by 872
Abstract
With the development of network technology, information networks have become one of the most important means for people to understand society. As the scale of information networks expands, the construction of network graphs and high-dimensional feature representation will become major factors affecting the [...] Read more.
With the development of network technology, information networks have become one of the most important means for people to understand society. As the scale of information networks expands, the construction of network graphs and high-dimensional feature representation will become major factors affecting the performance of spectral clustering algorithms. To address this issue, in this paper, we propose a spectral clustering algorithm based on similarity graphs and non-linear deep embedding, named SEG_SC. This algorithm introduces a new spectral clustering model that explores the underlying structure of graphs through sparse similarity graphs and deep graph representation learning, thereby enhancing graph clustering performance. Experimental analysis with multiple types of real datasets shows that the performance of this model surpasses several advanced benchmark algorithms and performs well in clustering on medium- to large-scale information networks. Full article
Show Figures

Figure 1

14 pages, 7845 KiB  
Article
Cross-Domain Person Re-Identification Based on Feature Fusion Invariance
by Yushi Zhang, Heping Song and Jiawei Wei
Appl. Sci. 2024, 14(11), 4644; https://doi.org/10.3390/app14114644 - 28 May 2024
Cited by 1 | Viewed by 758
Abstract
Cross-domain person re-identification is a technique for identifying the same individual across different cameras or environments that necessitates the overcoming of challenges posed by scene variations, which is a primary challenge in person re-identification and a bottleneck for its practical applications. In this [...] Read more.
Cross-domain person re-identification is a technique for identifying the same individual across different cameras or environments that necessitates the overcoming of challenges posed by scene variations, which is a primary challenge in person re-identification and a bottleneck for its practical applications. In this paper, we learn the invariance model of cross-domain feature fusion in a labeled source domain and an unlabeled target domain. First, our method learns the global and local fusion features of a person in the source domain by means of supervised learning with no component label and only person identification and obtains the fusion features of the person in the source and target domains by means of unsupervised learning. Based on person fusion features, this paper introduces feature memory to store the fused target features and designs a cross-domain invariance loss function to improve the cross-domain adaptability of the person. Finally, this paper carries out cross-domain person re-identification verification experiments between the Market-1501 and DukeMTMC-reID datasets; the experimental results show that the proposed method achieves significant performance improvement in cross-domain person re-identification. Full article
Show Figures

Figure 1

18 pages, 16581 KiB  
Article
Lightweight Infrared and Visible Image Fusion Based on Nested Connections and Res2Net
by Yi Peng, Xinyue Tu and Qingqing Yang
Appl. Sci. 2024, 14(11), 4589; https://doi.org/10.3390/app14114589 - 27 May 2024
Viewed by 853
Abstract
Image fusion is a pivotal image-processing technology designed to merge multiple images from various sensors or imaging modalities into a single composite image. This process enhances and extracts the information contained across the images, resulting in a final image that is more informative [...] Read more.
Image fusion is a pivotal image-processing technology designed to merge multiple images from various sensors or imaging modalities into a single composite image. This process enhances and extracts the information contained across the images, resulting in a final image that is more informative and of superior quality. This paper introduces a novel method for infrared and visible image fusion, utilizing nested connections and frequency-domain decomposition techniques to effectively solve the problem of lost image detail features. By incorporating depthwise separable convolution technology, the method reduces the computational complexity and model size, thereby increasing computational efficiency. A multi-scale residual fusion network, R2FN (Res2Net Fusion Network), has been designed to replace traditional manually designed fusion strategies, enabling the network to better preserve detail information in the image while improving the quality of the fused image. Moreover, a new loss function is proposed, which is aimed at enhancing important feature information while preserving more significant features. Experimental results on public datasets indicate that the method not only retains the detail information of visible-light images but also highlights the significant features of infrared images while maintaining a minimal number of parameters. Full article
Show Figures

Figure 1

17 pages, 1335 KiB  
Article
Link Prediction Based on Data Augmentation and Metric Learning Knowledge Graph Embedding
by Lijuan Duan, Shengwen Han, Wei Jiang, Meng He and Yuanhua Qiao
Appl. Sci. 2024, 14(8), 3412; https://doi.org/10.3390/app14083412 - 18 Apr 2024
Viewed by 1022
Abstract
A knowledge graph is a repository that represents a vast amount of information in the form of triplets. In the training process of completing the knowledge graph, the knowledge graph only contains positive examples, which makes reliable link prediction difficult, especially in the [...] Read more.
A knowledge graph is a repository that represents a vast amount of information in the form of triplets. In the training process of completing the knowledge graph, the knowledge graph only contains positive examples, which makes reliable link prediction difficult, especially in the setting of complex relations. At the same time, current techniques that rely on distance models encapsulate entities within Euclidean space, limiting their ability to depict nuanced relationships and failing to capture their semantic importance. This research offers a unique strategy based on Gibbs sampling and connection embedding to improve the model’s competency in handling link prediction within complex relationships. Gibbs sampling is initially used to obtain high-quality negative samples. Following that, the triplet entities are mapped onto a hyperplane defined by the connection. This procedure produces complicated relationship embeddings loaded with semantic information. Through metric learning, this process produces complex relationship embeddings imbued with semantic meaning. Finally, the method’s effectiveness is demonstrated on three link prediction benchmark datasets FB15k-237, WN11RR and FB15k. Full article
Show Figures

Figure 1

20 pages, 1590 KiB  
Article
Query Optimization in Distributed Database Based on Improved Artificial Bee Colony Algorithm
by Yan Du, Zhi Cai and Zhiming Ding
Appl. Sci. 2024, 14(2), 846; https://doi.org/10.3390/app14020846 - 19 Jan 2024
Viewed by 2396
Abstract
Query optimization is one of the key factors affecting the performance of database systems that aim to enact the query execution plan with minimum cost. Particularly in distributed database systems, due to the multiple copies of the data that are stored in different [...] Read more.
Query optimization is one of the key factors affecting the performance of database systems that aim to enact the query execution plan with minimum cost. Particularly in distributed database systems, due to the multiple copies of the data that are stored in different data nodes, resulting in the dramatic increase in the feasible query execution plans for a query statement. Because of the increasing volume of stored data, the cluster size of distributed databases also increases, resulting in poor performance of current query optimization algorithms. In this case, a dynamic perturbation-based artificial bee colony algorithm is proposed to solve the query optimization problem in distributed database systems. The improved artificial bee colony algorithm improves the global search capability by combining the selection, crossover, and mutation operators of the genetic algorithm to overcome the problem of falling into the local optimal solution easily. At the same time, the dynamic perturbation factor is introduced so that the algorithm parameters can be dynamically varied along with the process of iteration as well as the convergence degree of the whole population to improve the convergence efficiency of the algorithm. Finally, comparative experiments conducted to assess the average execution cost of Top-k query plans generated by the algorithms and the convergence speed of algorithms under the conditions of query statements in six different dimension sets. The results demonstrate that the Top-k query plans generated by the proposed method have a lower execution cost and a faster convergence speed, which can effectively improve the query efficiency. However, this method requires more execution time. Full article
Show Figures

Figure 1

13 pages, 729 KiB  
Article
Hybrid Clustering Algorithm Based on Improved Density Peak Clustering
by Limin Guo, Weijia Qin, Zhi Cai and Xing Su
Appl. Sci. 2024, 14(2), 715; https://doi.org/10.3390/app14020715 - 15 Jan 2024
Cited by 2 | Viewed by 1809
Abstract
In the era of big data, unsupervised learning algorithms such as clustering are particularly prominent. In recent years, there have been significant advancements in clustering algorithm research. The Clustering by Density Peaks algorithm is known as Clustering by Fast Search and Find of [...] Read more.
In the era of big data, unsupervised learning algorithms such as clustering are particularly prominent. In recent years, there have been significant advancements in clustering algorithm research. The Clustering by Density Peaks algorithm is known as Clustering by Fast Search and Find of Density Peaks (density peak clustering). This clustering algorithm, proposed in Science in 2014, automatically finds cluster centers. It is simple, efficient, does not require iterative computation, and is suitable for large-scale and high-dimensional data. However, DPC and most of its refinements have several drawbacks. The method primarily considers the overall structure of the data, often resulting in the oversight of many clusters. The choice of truncation distance affects the calculation of local density values, and varying dataset sizes may necessitate different computational methods, impacting the quality of clustering results. In addition, the initial assignment of labels can cause a ‘chain reaction’, i.e., if one data point is incorrectly labeled, it may lead to more subsequent data points being incorrectly labeled. In this paper, we propose an improved density peak clustering method, DPC-MS, which uses the mean-shift algorithm to find local density extremes, making the accuracy of the algorithm independent of the parameter dc. After finding the local density extreme points, the allocation strategy of the DPC algorithm is employed to assign the remaining points to appropriate local density extreme points, forming the final clusters. The robustness of this method in handling uncertain dataset sizes adds some application value, and several experiments were conducted on synthetic and real datasets to evaluate the performance of the proposed method. The results show that the proposed method outperforms some of the more recent methods in most cases. Full article
Show Figures

Figure 1

22 pages, 23761 KiB  
Article
Robust Ranking Kernel Support Vector Machine via Manifold Regularized Matrix Factorization for Multi-Label Classification
by Heping Song, Yiming Zhou, Ebenezer Quayson, Qian Zhu and Xiangjun Shen
Appl. Sci. 2024, 14(2), 638; https://doi.org/10.3390/app14020638 - 11 Jan 2024
Viewed by 987
Abstract
Multi-label classification has been extensively researched and utilized for several decades. However, the performance of these methods is highly susceptible to the presence of noisy data samples, resulting in a significant decrease in accuracy when noise levels are high. To address this issue, [...] Read more.
Multi-label classification has been extensively researched and utilized for several decades. However, the performance of these methods is highly susceptible to the presence of noisy data samples, resulting in a significant decrease in accuracy when noise levels are high. To address this issue, we propose a robust ranking support vector machine (Rank-SVM) method that incorporates manifold regularized matrix factorization. Unlike traditional Rank-SVM methods, our approach integrates feature selection and multi-label learning into a unified framework. Within this framework, we employ matrix factorization to learn a low-rank robust subspace within the input space, thereby enhancing the robustness of data representation in high-noise conditions. Additionally, we incorporate manifold structure regularization into the framework to preserve manifold relationships among low-rank samples, which further improves the robustness of the low-rank representation. Leveraging on this robust low-rank representation, we extract a resilient low-rank features and employ them to construct a more effective classifier. Finally, the proposed framework is extended to derive a kernelized ranking approach, for the creation of nonlinear multi-label classifiers. To effectively solve this non-convex kernelized method, we employ the augmented Lagrangian multiplier (ALM) and alternating direction method of multipliers (ADMM) techniques to obtain the optimal solution. Experimental evaluations conducted on various datasets demonstrate that our framework achieves superior classification results and significantly enhances performance in high-noise scenarios. Full article
Show Figures

Figure 1

29 pages, 7450 KiB  
Article
Efficient Diagnosis of Autism Spectrum Disorder Using Optimized Machine Learning Models Based on Structural MRI
by Reem Ahmed Bahathiq, Haneen Banjar, Salma Kammoun Jarraya, Ahmed K. Bamaga and Rahaf Almoallim
Appl. Sci. 2024, 14(2), 473; https://doi.org/10.3390/app14020473 - 5 Jan 2024
Cited by 2 | Viewed by 2095
Abstract
Autism spectrum disorder (ASD) affects approximately 1.4% of the population and imposes significant social and economic burdens. Because its etiology is unknown, effective diagnosis is challenging. Advancements in structural magnetic resonance imaging (sMRI) allow for the objective assessment of ASD by examining structural [...] Read more.
Autism spectrum disorder (ASD) affects approximately 1.4% of the population and imposes significant social and economic burdens. Because its etiology is unknown, effective diagnosis is challenging. Advancements in structural magnetic resonance imaging (sMRI) allow for the objective assessment of ASD by examining structural brain changes. Recently, machine learning (ML)-based diagnostic systems have emerged to expedite and enhance the diagnostic process. However, the expected success in ASD was not yet achieved. This study evaluates and compares the performance of seven optimized ML models to identify sMRI-based biomarkers for early and accurate detection of ASD in children aged 5 to 10 years. The effect of using hyperparameter tuning and feature selection techniques are investigated using two public datasets from Autism Brain Imaging Data Exchange Initiative. Furthermore, these models are tested on a local Saudi dataset to verify their generalizability. The integration of the grey wolf optimizer with a support vector machine achieved the best performance with an average accuracy of 71% (with further improvement to 71% after adding personal features) using 10-fold Cross-validation. The optimized models identified relevant biomarkers for diagnosis, lending credence to their truly generalizable nature and advancing scientific understanding of neurological changes in ASD. Full article
Show Figures

Figure 1

Back to TopTop