SBNNR: Small-Size Bat-Optimized KNN Regression

Seyghaly, Rasool; Garcia, Jordi; Masip-Bruin, Xavi; Kuljanin, Jovana

doi:10.3390/fi16110422

Open AccessArticle

SBNNR: Small-Size Bat-Optimized KNN Regression

¹

Advanced Network Architectures Laboratory (CRAAX), Universitat Politècnica de Catalunya (UPC) BarcelonaTECH, 08800 Vilanova, Spain

²

Aeronautical Division, Universitat Politècnica de Catalunya BarcelonaTECH, 08034 Barcelona, Spain

^*

Authors to whom correspondence should be addressed.

Future Internet 2024, 16(11), 422; https://doi.org/10.3390/fi16110422

Submission received: 26 September 2024 / Revised: 5 November 2024 / Accepted: 12 November 2024 / Published: 14 November 2024

(This article belongs to the Special Issue Deep Learning Techniques Addressing Data Scarcity)

Download

Browse Figures

Versions Notes

Abstract

:

Small datasets are frequent in some scientific fields. Such datasets are usually created due to the difficulty or cost of producing laboratory and experimental data. On the other hand, researchers are interested in using machine learning methods to analyze this scale of data. For this reason, in some cases, low-performance, overfitting models are developed for small-scale data. As a result, it appears necessary to develop methods for dealing with this type of data. In this research, we provide a new and innovative framework for regression problems with a small sample size. The base of our proposed method is the K-nearest neighbors (KNN) algorithm. For feature selection, instance selection, and hyperparameter tuning, we use the bat optimization algorithm (BA). Generative Adversarial Networks (GANs) are employed to generate synthetic data, effectively addressing the challenges associated with data sparsity. Concurrently, Deep Neural Networks (DNNs), as a deep learning approach, are utilized for feature extraction from both synthetic and real datasets. This hybrid framework integrates KNN, DNN, and GAN as foundational components and is optimized in multiple aspects (features, instances, and hyperparameters) using BA. The outcomes exhibit an enhancement of up to 5% in the coefficient of determination (

R^{2}

score) using the proposed method compared to the standard KNN method optimized through grid search.

Keywords:

regression; K-nearest neighbor; bat algorithm; instance selection; feature selection

1. Introduction

Machine learning has rapidly become a ubiquitous tool across various scientific domains, offering significant capabilities for data analysis and modeling. Whenever experimental, laboratory, or observational datasets are available, machine learning is frequently the first choice for data-driven exploration and insight generation.

The landscape of real-world data is marked by its diversity, encompassing a wide spectrum of feature types and varying data volumes. While some fields benefit from plentiful data, many others suffer from limited datasets due to the challenges of experimental data collection. In fact, it is not uncommon to encounter datasets with fewer than 100 samples, prompting researchers to explore machine learning approaches customized for small sample sizes. This “small data” domain spans various fields, such as biomedicine, material design, petroleum engineering, biodiesel research, and numerous other disciplines [1,2].

Developing models under the constraints of small datasets poses unique challenges, primarily related to data distribution and gaps. In datasets with sufficient samples, data points are typically distributed across a continuous feature space. However, smaller datasets often present sporadic or unevenly distributed data points [3], which can lead to overfitting and low prediction accuracy [4].

Addressing these challenges requires specialized frameworks and algorithms designed to enhance model generality and accuracy, particularly for regression tasks, where the synthesis of artificial data points within a feature space must be coupled with appropriate target values [5]. In this research, we focus on regression problems constrained by small data sizes, typically comprising around 500 samples or fewer. We aim to develop a framework based on the KNN algorithm that will enhance the robustness and reliability of models when dealing with limited data. For feature selection, instance selection, and hyperparameter tuning, we applied the Bat Algorithm (BA) to optimize these components. To address data sparsity challenges, we employed Generative Adversarial Networks (GANs) to generate synthetic data, enhancing the dataset’s diversity. Additionally, DNNs were utilized as a deep learning approach for feature extraction from both synthetic and real datasets, ensuring a comprehensive representation of features.

The KNN algorithm was chosen as our base model due to its effectiveness in handling small datasets without requiring strict parametric assumptions. KNN’s non-parametric nature allows it to directly adapt to the inherent structure of data, making it particularly valuable for limited samples, where traditional assumptions about data distribution could lead to biased models. Furthermore, KNN’s simplicity supports straightforward optimization strategies for instance and feature selection, both of which are critical in small-data contexts. By combining KNN with synthetic data generation through GANs and feature extraction using DNNs, we further enhance its capacity to capture underlying patterns, reduce overfitting, and improve generalizability. This hybrid approach takes advantage of KNN’s simplicity while addressing small-data limitations, creating a robust and reliable framework for real-world applications.

In other words, we selected KNN over other common algorithms because of its simplicity and interpretability, which make it well-suited for small datasets by allowing predictions based directly on similarity without requiring a complex model structure. For example, Support Vector Regression (SVR) may require the selection of a kernel function and specific assumptions about data distribution, which can introduce bias in small datasets where patterns are less predictable. On the other hand, while algorithms such as Random Forest (RF) also seem to be a good choice, they are often prone to overfitting with limited data due to their complexity, requiring more samples to generalize effectively. However, future works could adopt RF in the same framework to capture more complex relationships as synthetic data generation is refined, as RF might handle certain non-linear patterns that KNN may overlook.

Given the need of our approach for a strong and efficient optimization method, the BA is well-suited for this task due to its versatile and efficient exploration–exploitation strategy inspired by echolocation. Unlike the Genetic Algorithm (GA), which requires substantial computational resources for crossover and mutation operations, the BA efficiently converges to high-quality solutions with fewer evaluations. Similarly, while Simulated Annealing (SA) is robust in escaping local minima, its reliance on a slow cooling schedule can hinder rapid convergence, making it less efficient for small datasets.

Our primary contribution is a robust framework for small-data regression using KNN. We achieve holistic optimization by integrating feature selection, instance selection, and hyperparameter tuning. The novel application of the bat algorithm to optimize the KNN weight function enhances both accuracy and efficiency. Our framework, which was rigorously evaluated on real-world datasets, demonstrates superior performance compared to traditional methods, marking a significant advancement in small-data machine learning.

The remainder of this paper is organized as follows: Section 2 presents an overview of similar research studies. Section 3 introduces the literature underlying the concepts used in this study. Section 4 discusses the proposed method, with experimental results on real-world datasets described in Section 5 using common evaluation metrics. Finally, conclusions and recommendations are provided in Section 7.

2. Related Works

Working with small data and avoiding overfitting and underfitting simultaneously, as mentioned earlier, have their difficulties.

Some preprocessing techniques implemented for this purpose are based, in part, on Zadeh’s [6] fuzzy theory concept. Huang [7] proposed the diffusion information principle to broaden the distribution of artificial data to overfill the data gap resulting from the absence of enough data. Huang and Moraga [8] used the idea of a normal diffusion function [7], where the method makes a pair of artificial data produced from experimental data, which is the foundation of the Diffusion Neural Network (DNN), to distribute data evenly.

Shaikhina and Khovanova [9] addressed validation issues and fluctuations in regression NNs trained on small datasets by proposing a method combining multiple runs and surrogate data analysis. Their approach was benchmarked against state-of-the-art ensemble NNs, examining the impact of dataset size on NN performance. Applied to predict the compressive strength (CS) of the femoral trabecular bone in severe osteoarthritis patients, their NN model achieved a standard error of 0.85 MPa and an accuracy of 98.3%, surpassing an ensemble NN model by 11%. When tested on porous concrete data, the framework demonstrated strong generalizability, achieving 86.5% accuracy on 300 samples from only 56 training samples, comparable to a model trained on 1030 samples.

In addition, Zhang and Ling [10] provided a method to conquer small-sized datasets by adding a rough estimate of an ingredient’s properties in the features. The construction of learning models utilizes the rough estimation distribution within this feature space. In many circumstances, using rough estimations can improve the accuracy of models.

Chapelle et al. [11] stated that the difference between the empirical covariance matrix and its expected value is critical for slight sample-size regression. Considering this disparity, they arrived at a model selection algorithm that behaves similarly to the best in class. A deeper analysis of the distribution of the eigenvalues of a covariance matrix will be used to improve the SEB method in future research. This is commonly used in machine learning to determine the number of centers in an RBF network.

KNN is one of the models that show good accuracy for small-size datasets because of its nature that does not require any specific training phase. Accordingly, a key idea for small data that is also considered in this study is to improve this model. Therefore, in the following, we deal with improvements made to KNN.

For instance selection in regression tasks, Rodríguez-Fdez et al. [12] proposed a class-conditional method (CCISR). It is an extension of the KNN classifier’s instance selection method. This method was examined in 12 real-world cases, in which the most critical samples were kept and a significant reduction ratio was obtained. However, this method takes more memory and time to run, so it usually runs out of resources in real-world situations. Additionally, Guillén et al. [13] came up with a new way to choose which examples to use in time-series prediction. Although this method works well with artificial data, it must be evaluated with real-world datasets.

It is difficult to obtain fine accuracy with a single learning algorithm, whereas ensemble methods (using several base models) may achieve better results [14]. Arnaiz-González et al. [15] proposed the ensemble idea of combining instance selection algorithms for regression problems for this purpose. With regard to prediction error and reduced subset size, the ensemble algorithms outperformed the original instance selection algorithms.

The authors of [16] proposed DISKR, an instance selection method for KNNregression, emphasizing efficiency. To begin, this algorithm eliminates outlier instances (data points). Secondly, the algorithm sorts the remaining instances based on the distance between their actual and estimated production from their neighbors. At the last step, DISKRremoves the instances that have the least effect on the regressor one at a time.

Numerous evolutionary algorithms have also been used to select features and instances. In the bankruptcy prediction problem, Ahn and Kim [17] used a genetics-based technique to solve feature selection and instance selection optimally and simultaneously. However, the dataset they used contained 2670 data points. Although the dataset size in this research is not small, it is relevant to our study in terms of the use of evolutionary algorithms for feature selection and instance selection. Likewise, Ros et al. [18] offered a hybrid strategy based on a genetic algorithm in which the problems of instance and feature selection are addressed as one optimization problem.

Ho et al. [19] also used genetics-based methods for this subject. They created an intelligent genetic algorithm (IGA) that can handle selection tasks for both features and instances at the same time by suggesting a specific orthogonal cross operator. It was demonstrated in this work that IGA outperforms the solution proposed by Kuncheva and Jain [20].

Pedrycz and Ahmad [21], Aydogan et al. [22], Rattá et al. [23], and Das et al. [24] also tested GAs in a variety of domains to solve the instance selection and feature selection tasks distinctively.

For data augmentation in cases of extremely small datasets, an advanced neural network technique that is particularly noteworthy is the generative adversarial network (GAN)-based approach for data augmentation proposed by Xu et al. [25]. Their method (CTGAN) proves highly effective for small datasets, where traditional methods often fall short, by generating high-quality synthetic data and outperforming Bayesian approaches. The authors highlighted CTGAN’s ability to handle the complexities of tabular data—such as imbalanced categorical columns and multi-modal continuous values—making it particularly well-suited to small datasets. Benchmarking results further showcase its robust performance across various small datasets, affirming CTGAN as a valuable tool for synthetic data generation in data-scarce scenarios.

Izonin et al. [26] also presented an advanced neural network-based approach to improve prediction accuracy on small and extremely small biomedical datasets. Building on an input-doubling technique to increase data size without creating new samples, the method employs ensemble averaging principles using a single non-linear AI model for prediction. Key contributions include a tailored data augmentation process, mathematical performance metrics for robust assessment, and the elimination of traditional training requirements, making it ideal for minimizing overfitting risks in limited-data contexts. Tests on two biomedical datasets showed high accuracy and significant error reduction over conventional methods, highlighting the model’s potential to uncover novel insights for medical diagnostics and treatments.

The Binary Bat Algorithm (BBA) proposed in in [27] as the application of a binary version of the BA is geared toward feature selection. The bat algorithm (BA) is also used in some improvements of the KNN algorithm, such as that proposed by Saleem et al. [28], but it is only used for feature selection. However, as we all know, not all KNN configurations have been optimized simultaneously with evolutionary algorithms, specifically the bat algorithm. In addition, data dimensions are generally not considered for optimization.

Jeong et al. [29] proposed a novel approach that combines out-of-distribution (OOD) data and transfer learning (TL) to enhance predictive modeling in medical scenarios with limited data, focusing on acute respiratory failure (ARF) due to pesticide poisoning. Key contributions include the pioneering application of OOD and TL techniques, which are typically used in image processing, to electronic health records (EHRs), resulting in improved model weight initialization and performance. This approach outperforms traditional multi-layer perceptron (MLP) models, showing reduced bias and narrower confidence intervals and achieving higher AUROC, AUPRC, MSE, and

R^{2}

metrics. The study also emphasized the efficacy of TL in improving model reliability for limited-data contexts and identified avenues for further exploration in TL applications for diverse medical settings.

Conrad et al. [30] conducted a benchmark study assessing the performance of four AutoML frameworks on regression tasks within small tabular datasets in material design. Their approach benchmarks AutoML against traditional data analysis methods across twelve materials engineering datasets, emphasizing AutoML’s effectiveness in small-data scenarios. Key contributions include demonstrating AutoML’s superior performance and robustness, particularly through nested cross-validation (NCV) for enhanced reliability on limited data. The study also provided scripts to support wider accessibility and application of these methods.

Key results show that AutoML frameworks generally surpass traditional methods, with Auto-sklearn excelling overall, especially in shorter run times, and MLjar showing robustness across tasks. Multi-output approaches have shown potential, though AutoML frameworks face challenges in handling them due to task-specific limitations. These findings encourage broader AutoML adoption in materials science, highlighting the importance of data sampling for reliability in small-dataset scenarios.

Existing methods aimed at improving the KNN algorithm and similar algorithms have individually addressed feature selection, hyperparameter tuning, and instance selection as separate concerns. Our approach, on the other hand, seeks to tackle these concerns by formulating them in a single optimization problem, with the goal of improving the KNN algorithm, especially for regression tasks involving small-sized datasets.

3. Preliminaries

3.1. BatAlgorithm

The bat algorithm (BA) is a metaheuristic optimization technique inspired by the echolocation behavior of bats [31]. Bats navigate and hunt by emitting sound waves, with the echoes bouncing back from objects aiding in prey location. The bat algorithm emulates this echolocation process to tackle intricate optimization problems. By replicating how bats modulate their pulse rates and loudness in response to prey proximity, the algorithm adeptly harmonizes exploration and exploitation within the optimization domain. This flexibility allows the BA (Algorithm 1) to be a versatile solution for resolving diverse optimization obstacles [32].

Algorithm 1 Bat Optimization Algorithm (BA)

1:: Input: objective function $f (x)$ and search space of optimization problem
2:: Initialize bat population (positions $x_{i}$ , velocities $v_{i}$ , and pulse frequencies $f_{i}$ )
3:: Set maximum iterations T, pulse rates $r_{i}$ , and loudness values $A_{i}$
4:: for $t = 1$ to T do
5:: for each bat $b_{i}$ do
6:: Update bat’s position and velocity using Equations (1) and (2)
7:: Generate a random number $r a n d$ uniformly in $[0, 1]$
8:: if $r a n d > r_{i}$ then
9:: Select a solution from the list of best solutions
10:: Produce a solution in the neighborhood of the best solution
11:: end if
12:: if $r a n d < A_{i}$ and $f (x_{i}) < f (GlobalBest)$ then
13:: Accept the new solution
14:: Increase $r_{i}$ and decrease $A_{i}$
15:: end if
16:: end for
17:: end for
18:: Update $GlobalBest$ by sorting bats

To solve a specific optimization task, the initial phase involves setting up the bat frequency

f_{i}

, velocity

v_{i}

, and position

x_{i}

, in addition to the loudness

A_{i}

and pulse rates

r_{i}

. Subsequently, the bat’s position and velocity undergo iterative updates based on Equations (1) and (2) until the predetermined threshold of iterations is reached.

f_{i} = f_{\min} + (f_{\min} - f_{\max}) β

(1)

x_{j}^{i} (t) = x_{j}^{i} (t - 1) + v_{j}^{i} (t)

(2)

Here,

f_{m a x}

signifies the maximum frequency,

f_{m i n}

represents the minimum frequency, and

β

signifies a randomly selected value in the range of 0 to 1. The j-th dimension of a bat pertains to the velocity component (V) and position component (X) of the position (

x_{i}

) and velocity (

v_{i}

). Equation (3) is employed to integrate the notion of random walks, whereas Equations (4) and (5) govern the updates for loudness and pulse rates.

x_{new} = x_{old} + ε A (t)

(3)

A_{i} (t + 1) = α A_{i} (t)

(4)

r_{i} (t + 1) = r_{i} (0) [1 - e^{- γ t}]

(5)

The mean loudness at a given time (t) is denoted by

A (t)

, while the pulse emission rate at

t = 0

is influenced by

ϵ \in [0, 1]

, governing the direction and intensity of the random walk, as also indicated by

r \in [0, 1]

. The loudness at

t = 0

is determined through a stochastic selection process at the algorithm’s initiation [33].

3.2. K-Nearest Neighbors

K-Nearest Neighbors (KNN) is an intuitive and non-parametric machine learning technique that excels in both simplicity and effectiveness. In KNN regression, the primary objective is to predict the value of a query data point based on the values of its nearest neighbors within the training dataset [34].

As illustrated in Algorithm 2, the procedure is as follows: For a specified query point, the algorithm identifies its “K” nearest neighbors from the training dataset. These neighbors are selected utilizing a distance metric, commonly employing Euclidean distance, although alternative distance metrics may be suitable based on the specific problem requirements.

Algorithm 2 K-nearest neighbors (KNN) Regression

1:: procedure KNN_Regression( $X_t r a i n, y_t r a i n, X_q u e r y, K$ )
2:: for each $x_q u e r y$ in $X_q u e r y$ do
3:: $d i s t a n c e s \leftarrow []$
4:: for each $(x_t r a i n, y_t r a i n)$ in $z i p (X_t r a i n, y_t r a i n)$ do
5:: $d i s t a n c e \leftarrow Euclidean_Distance (x_q u e r y, x_t r a i n)$
6:: $d i s t a n c e s . a p p e n d ((y_t r a i n, d i s t a n c e))$
7:: end for
8:: $s o r t e d_d i s t a n c e s \leftarrow Sort By Distance (d i s t a n c e s)$
9:: $n e a r e s t_n e i g h b o r s \leftarrow s o r t e d_d i s t a n c e s [: K]$
10:: $p r e d i c t e d_v a l u e \leftarrow Average (n e a r e s t_n e i g h b o r s)$
11:: Output( $p r e d i c t e d_v a l u e$ )
12:: end for
13:: end procedure
14:: procedure Euclidean_Distance( $x 1, x 2$ )
15:: Return $\sqrt{\sum {(x 1_{i} - x 2_{i})}^{2}}$ for each feature $x 1_{i}, x 2_{i}$ in $z i p (x 1, x 2)$
16:: end procedure
17:: procedure Sort_By_Distance( $d i s t a n c e s$ )
18:: Return $s o r t e d (d i s t a n c e s, k e y = l a m b d a x : x [1])$
19:: end procedure
20:: procedure Average( $n e a r e s t_n e i g h b o r s$ )
21:: $t o t a l \leftarrow 0$
22:: for each $(y_n e i g h b o r,_)$ in $n e a r e s t_n e i g h b o r s$ do
23:: $t o t a l \leftarrow t o t a l + y_n e i g h b o r$
24:: end for
25:: Return $t o t a l / K$
26:: end procedure

Upon identifying the nearest neighbors, their known output values are utilized to approximate the output for the query point. In regression tasks, this generally entails computing the mean or weighted average of the output values from the neighboring data points. This approach is predicated on the notion that analogous data points generally yield comparable output values.

KNN is distinguished by its absence of prior assumptions regarding the data distribution or the functional relationship between features and target values. Rather, it utilizes the intrinsic structure of the data, rendering it an especially advantageous instrument for tasks involving complex or obscure data relationships. This text succinctly outlines the mathematics of this renowned model.

Consider a training dataset (

S = {(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}

) where any

x_{i}

is a single input vector and

y_{i}

is the corresponding output value. For a test (unseen) instance such as x,

y_{i} (x)

is the output value (

y_{i}

) of the

i_{t h}

nearest neighbor to x. The prediction is made generally using Equation (6) in regression tasks.

\hat{y} = \frac{1}{k} \sum_{i = 1}^{k} y_{i} (x)

(6)

\hat{y}

is the final prediction.

4. Proposed Method

We aim to optimize the KNN algorithm for small-data regression problems, where datasets are limited to around 500 samples. Challenges include overfitting, instance and feature selection, and balancing model complexity with accuracy, which are hard to address due to data sparsity. By using the bat algorithm (BA) to optimize hyperparameters, feature selection, and instance selection, we seek to enhance model reliability and generalization. Our goal is to provide a solution that improves predictive accuracy and performance robustness across scientific and industrial domains with limited data.

For instance, in scenarios characterized by sparse and irregularly distributed data points, our method offers valuable benefits. By utilizing the bat algorithm to optimize KNN regression, we improve the precision and dependability of predictive models in spite of the difficulties presented by limited data availability. Practitioners can use this to obtain valuable insights and make well-informed decisions in areas like environmental monitoring, anomaly detection, and financial forecasting, even when there is a lack of data or when the data are not evenly distributed.

As introduced above, the bat algorithm (BA) is a metaphor-based metaheuristic (global optimization) algorithm that is derived from bat echolocation behavior. In this research, we used the bat algorithm to make optimal decisions with respect to the hyperparameters, features, and instances used in KNN regression for small datasets.

Instance selection and feature selection are crucial because they directly influence the performance and efficiency of the KNN algorithm. By carefully selecting the most relevant instances, we minimize the risk of overfitting and ensure that the model is trained on the most representative data points. Similarly, feature selection helps identify the most informative instances to improve the model’s accuracy and interpretability. In the context of small-data regression problems, these selections become even more critical, as they enhance the algorithm’s ability to generalize from limited data, avoiding the pitfalls of noise and irrelevant information that can skew predictions. Alongside instance and feature selection, a GAN is used to generate synthetic data to address data sparsity, while a DNN is employed to extract features from both the original and synthetic datasets. The parameters of the GAN and DNN are also optimized using the BA. Through optimized instance and feature selection, feature extraction, and the generation of synthetic data, KNN can achieve better performance, ensuring robust and reliable predictions despite the constraints of small datasets.

4.1. KNN Hyperparameters

There are three essential hyperparameters for KNN:

K: The number of neighbors that should be used for KNN queries. It must be an integer between 1 and the number of selected instances.
The second hyperparameter determines whether neighbors are given uniform weights for prediction or are given weighted coefficients for prediction (commonly the inverse of their distance to the query point). In this case, neighbors who are closer to a query point have a stronger influence than neighbors who are farther away. In this research, we define a specific weight function that is optimized by the BA.
The third important hyperparameter in KNN is the method used to find the nearest neighbors. Brute force, ball-tree, and KD-tree are standard possible algorithms to do this. This work considers the brute force algorithm because the method focuses on small-size data, and it is essential and possible to consider all selected instances.

4.2. Instance Selection

Instance selection plays a pivotal role in the development of more robust and generalizable models, particularly in instance-based methods like KNN. Its fundamental objective is the removal of redundant data, a process of paramount importance when dealing with small datasets. In such limited datasets, the efficient curation of instances becomes crucial, as individual data points can wield disproportionate influence, thereby significantly impacting both the accuracy and generality of the resulting model.

In essence, instance selection serves as a strategic filter, sieving through the data landscape to identify the most informative and representative examples. This selective process not only streamlines computational efficiency but also guards against the adverse effects of overfitting. Small datasets are particularly susceptible to overfitting, where models may excessively adapt to idiosyncrasies within individual data points, leading to poor generalization on unseen data.

By carefully curating a subset of instances, instance selection aims to strike a delicate balance between retaining essential information and eliminating redundancy. In doing so, it empowers models like KNN to better discern meaningful patterns, reduce sensitivity to noise, and enhance predictive performance, all while navigating the challenging terrain of data scarcity.

Deciding on the best combination of instances leads to a search of a

2^{m}

space, where m is the total quantity of training data. Therefore, using evolutionary algorithms such as the bat algorithm can be an excellent way to search in this space.

4.3. Feature Selection

The fundamental objective of feature selection is to identify and retain the most informative and relevant subset of features, a task of paramount importance in the context of small data.

In essence, feature selection serves as a strategic filter, sifting through the multitude of available features to pinpoint those that contribute significantly to the model’s predictive power. This selective process offers several advantages, including improved computational efficiency and a reduction in the risk of overfitting, which is particularly pertinent in scenarios where data are limited.

Small datasets are particularly prone to overfitting, a scenario in which models excessively conform to the noise or peculiarities within the data. Feature selection aims to delicately curate the feature set for modeling, aiming to find an equilibrium between preserving crucial information and eliminating redundancy. This process enhances the model’s ability to unveil significant patterns and ensures resilience in dealing with limited data.

The process of feature selection involves exploring a vast search space of potential feature combinations, which can be a computationally challenging task, especially in high-dimensional datasets. To navigate this space effectively and identify the optimal subset of features, various techniques are employed. Evolutionary algorithms similar to those used in instance selection, such as the bat algorithm, offer a promising avenue for efficiently exploring the combinatorial possibilities and arriving at feature subsets that enhance model accuracy and generalization in the context of small data.

For feature selection, we have a search problem in the space of

2^{d}

, where d is the count of features in the dataset.

4.4. Deep Neural Networks (DNNs) for Feature Extraction

Deep neural networks (DNNs) are fundamental to contemporary machine learning, especially for tasks that require the processing of high-dimensional data. Our methodology utilizes DNNs to extract features from the dataset, facilitating the recognition of intricate patterns and relationships that may not be readily observable. Through the utilization of numerous interconnected neuron layers, DNNs can autonomously acquire hierarchical representations of input data. This process improves the quality of features employed in subsequent analyses, facilitating more robust performance in classification and regression tasks. The capacity of DNNs to capture complex features markedly enhances our model’s predictive precision.

We selected DNNs over other possible options because of their ability to capture complex, non-linear patterns in data, which is essential for accurate feature extraction in small-dataset scenarios. Initial assessments indicated that PCA, while useful, may introduce unwanted linearity, potentially oversimplifying data representations. DNNs, by contrast, maintain non-linear characteristics, making them better suited to retain intricate data relationships.

Moreover, regularization techniques like dropout, batch normalization, and L2 regularization can be applied within DNNs to reduce sensitivity to noise, enhancing the model’s robustness. These regularizations ensure that the framework can be extended to future advancements without requiring vast datasets. This adaptability, combined with synthetic data augmentation via GANs, makes DNNs a versatile and resilient choice for our small-data framework.

4.5. Generative Adversarial Networks (GANs) for Synthetic Data Generation

Generative adversarial networks (GANs) represent a powerful framework for generating synthetic data that resemble real data distributions. In our methodology, GANs are utilized to address data sparsity by creating realistic synthetic samples that augment our existing dataset. This is particularly important in scenarios where acquiring additional labeled data is challenging or costly. By employing a dual network structure consisting of a generator and a discriminator, GANs iteratively improve their performance through adversarial training. The generator creates synthetic samples, while the discriminator evaluates their authenticity. This dynamic process not only enriches our dataset but also helps in reducing overfitting and enhancing the generalization of our models.

GANs play a crucial role in enhancing model performance, especially when handling small datasets. GANs address data scarcity by generating synthetic data that replicate the underlying distribution of the original data. In this process, GANs use a generator network to create synthetic samples and a discriminator network to evaluate their authenticity. The generator aims to produce samples (

G (z)

) that closely resemble real data, where z is a random noise vector sampled from a prior distribution (e.g., Gaussian). The discriminator (

D (x)

) distinguishes between real data points (x) and generated samples (

G (z)

). The networks are trained in an adversarial setup, iterating until the generator produces realistic data that the discriminator cannot reliably distinguish from the real data.

This adversarial setup effectively allows GANs to generate synthetic data that preserve complex, non-linear relationships within the original data distribution. The optimization objective for GAN training can be expressed as [35]

min_{G} max_{D} E_{x \sim p_{d a t a} (x)} [log D (x)] + E_{z \sim p_{z} (z)} [log (1 - D (G (z)))]

(7)

where

p_{d a t a} (x)

is the data distribution and

p_{z} (z)

is the prior distribution of the noise vector. Through this minimax game, the generator learns to produce data that follow the intricate patterns and variations within the original dataset, resulting in high-quality synthetic samples that enhance model robustness and generalization.

GANs are also beneficial for conditional generation, where the generation process is guided by specific data attributes or class labels. Conditional GANs (cGANs) introduce a condition (y) to both the generator and discriminator, enabling the generation of samples for under-represented categories within the dataset. This is achieved by modifying the objective function to incorporate the conditional variable (y):

min_{G} max_{D} E_{x \sim p_{d a t a} (x)} [log D (x | y)] + E_{z \sim p_{z} (z)} [log (1 - D (G (z | y)))]

(8)

By generating targeted synthetic data, cGANs ensure that even rare classes are adequately represented, improving model performance on imbalanced data and enhancing predictive power.

In summary, GANs facilitate model training on limited datasets by expanding the effective data size and preserving complex data relationships. This approach reduces overfitting and increases model reliability, making GANs a powerful tool for improving machine learning outcomes in data-scarce domains.

4.6. Proposed SBNNR Method

The proposed SBNNR (Small-size Bat-Optimized Nearest Neighbor Regression) is explained in Algorithm 3. In this algorithm, the length of position vectors is set to

n = m + 2 d + 9

. All components are doubles in the range of

(0, 1)

:

The First m components are used to select instances.
Components $x_{i}^{m + 1}, \dots, x_{i}^{m + d}$ are used to select features.
$x_{i}^{m + d + 1}, \dots, x_{i}^{m + 2 d}$ are coefficients of the weight function that is used in making final estimations.
$δ = x_{i}^{m + 2 d + 1}$ is the instance selection threshold.
$μ = x_{i}^{m + 2 d + 2}$ is used in determining k in the KNN algorithm.
GAN settings:

$G A N \{\begin{matrix} {L e a r n i n g R a t e}_{G A N} = x_{i}^{m + 2 d + 3} \\ {B a t c h S i z e}_{G A N} = 2^{[9 \times x_{i}^{m + 2 d + 4}]} \\ {L a y e r s}_{G A N} = 1 + [x_{i}^{m + 2 d + 5} \times 10] \end{matrix}$

(9)
DNN settings:

$D N N \{\begin{matrix} {L e a r n i n g R a t e}_{G A N} = x_{i}^{m + 2 d + 6} \\ {B a t c h S i z e}_{G A N} = 2^{[9 \times x_{i}^{m + 2 d + 7}]} \\ {L a y e r s}_{G A N} = 1 + [x_{i}^{m + 2 d + 8} \times 10] \\ {D r o p o u t R a t e}_{G A N} = x_{i}^{m + 2 d + 9} \times 0.5 \end{matrix}$

(10)

All of these components or decision variables are used to “translate” the position of a bat into a KNN predictor, as mentioned in line 12 of Algorithm 3. The “translate” routine is defined in the pseudo-code of Algorithm 4.

In other words, in this routine, a typical KNN estimator is created with selected features and instances, and k is determined by

μ

, with the weight function shown in line 14 of Algorithm 4.

One other building block of the proposed method is the objective function shown by

P e r f o r m a n c e (x)

in Equation (12). The objective function is based on the

R^{2}

score (Equation (11)), which is the most widely used criterion for evaluating the performance success of regression problems. It indicates how well the trends of the estimated results track the trends of the observed data [36].

Algorithm 3 SBNNR Algorithm

1:

Input: Training Dataset S with m data points and d input features.

2:

Initialize dimension of Bats positions

n = m + 2 d + 2

3:

Initialize population of bats with position

x_{i}

, velocity

v_{i}

and pulse frequency

f_{i}

4:

Set initial values for:

maximum iterations T
pulse rates $r_{i}$
loudness $A_{i}$

5:

for t in

{1, 2, \dots, T}

do

6:

for each bat

b_{i}

do

7:

Utilize Equations (1) and (2) to generate new solutions.

8:

if

r a n d > r_{i}

then

9:

Choose a solution amongst the best solutions list.

10:

Create a local solution centered on optimal solution.

11:

end if

12:

Translate Position

x_{i}

to a predictor

{K N N}_{i}

13:

Generate a random number

r a n d

uniformly in

[0, 1]

14:

if

r a n d < A_{i}

and

P e r f o r m a n c e ({K N N}_{i}) < P e r f o r m a n c e ({Global}_{Best})

then

15:

Mark new solutions as accepted.

16:

Increase

r_{i}

and decrease

A_{i}

.

17:

end if

18:

end for

19:

end for

20:

Output: the model with best fitness

Algorithm 4 Translate routine

1:: Input: Training Dataset S and position vector x.
2:: $δ = x_{(m + 2 d + 1)}$ and $μ = x_{(m + 2 d + 2)}$
3:: $S_{s e l e c t e d} = ϕ$ and $F_{s e l e c t e d} = ϕ$
4:: for $j \in {1, \dots, m}$ do
5:: if $x_{j} > δ$ then
6:: Add $S_{j}$ to $S_{s e l e c t e d}$ .
7:: end if
8:: end for
9:: for $j \in {m + 1, \dots, m + d}$ do
10:: if $x_{j} > δ$ then
11:: Add $j^{t h}$ feature to $F_{s e l e c t e d}$ .
12:: end if
13:: end for
14:: Define weight function ( $ω : S_{s e l e c t e d} \to (0, 1)$ ) for j-th selected instance as:
$ω (j) = \frac{x_{(m + d + j)}}{Σ_{i \in S_{s e l e c t e d}} x_{(m + d + i)}}$
15:: $k = μ \times | S_{s e l e c t e d} |$
16:: Develop GAN and DNN models for generate synthetic data and feature extraction based on $S_{s e l e c t e d}$ , $F_{s e l e c t e d}$ and Equations (9) and (10).
17:: $D^{'} =$ Dataset after applying GAN and DNN.
18:: Create a KNN predictor using k, $D^{'}$ , and $ω$
19:: Output: KNN Model

R^{2} = 1 - \frac{\sum {(y_{i} - x_{i})}^{2}}{\sum {(x_{i} - {\bar{x}}_{i})}^{2}}

(11)

Performance (x) = \frac{1}{5} \sum_{i = 1}^{5} R_{i}^{2} - \sqrt{\frac{1}{5} \sum_{i = 1}^{5} {(R_{i}^{2} - \bar{R^{2}})}^{2}}

(12)

Equation (12) shows our performance objective function used to optimize models based on the K-fold method. In this equation,

R_{i}^{2}

denotes the R-squared value for the i-th fold. R-squared (Equation (11)) is a measure of how well a model’s predictions match the actual data, with 1 being a perfect fit and 0 indicating no fit.

\sqrt{\frac{1}{5} \sum_{i = 1}^{5} {(R_{i}^{2} - \bar{R^{2}})}^{2}}

calculates the standard deviation of the R-squared values across the folds, in which

\bar{R^{2}}

stands for the mean of the R-squared values across the five folds.

In essence, a higher value of this objective function reflects models with improved accuracy and generalizability. To ensure the reliability and robustness of our predictive models, we employed a five-fold cross-validation (CV) technique. This method divides the dataset into five equal subsets, or “folds”. In each iteration, four folds are used to train the model, while the remaining fold is held out for testing. This process is repeated across all folds to obtain a comprehensive assessment of model performance.

5. Experimental Results

This section provides a detailed analysis of the proposed method’s performance across various datasets. This section begins by evaluating the method against a KNN algorithm optimized with grid search, presenting comparisons based on key metrics like

R^{2}

, RMSE, and MAE. Performance is measured in both training and test phases to assess generalizability. Following this, the method is benchmarked against recent state-of-the-art approaches, underscoring improvements in accuracy and robustness. This multi-faceted evaluation highlights the advantages of the proposed method for small-data regression tasks.

5.1. Comparison with Grid Search-Optimized KNN

In order to test our proposed method, several real-world datasets were selected, which are listed in Table 1.

To evaluate the proposed method, the datasets in Table 1 were modeled using a simple KNN algorithm with grid search on top of it for hyperparameter tuning alongside the proposed method. Table 2 shows the

R^{2}

scores and RMSE error rates for the test subset in these methods.

R^{2}

scores of all three tasks using the proposed method are higher than scores for the same task using KNN with grid search in the test phase. In the training phase, all

R^{2}

scores obtained using the proposed method are lower than those obtained by KNN with grid search. According to these facts, the proposed method produces more general models. This is confirmed in the last column of the table, where two of three tasks have lower RMSE values using the proposed method, while the RMSE for other method is equal to that of KNN with grid search.

In addition, as an example, a comparison of predicted values and expected values on the servo dataset is shown graphically in Figure 1 and Figure 2. In these figures, the red dots are the predicted values in the test phase, and the blue dots are the predicted values in the training phase, which are comparable to the expected values (green line).

When we compare the results from the two models, several important observations come to light. In the grid-optimized KNN model, we observe that nearly all data points in the training phase are touched, covering a wide range of the feature space. In contrast, the proposed method demonstrates a more focused distribution of training points, clustering closely around the expected value line.

However, the most significant distinction arises when we examine the behavior of the models on unseen test data. In the proposed model, none of the test points deviates abnormally from its expected value. Conversely, in the grid search model, a subset of test points exhibits undesirable deviations from the expected values, indicating suboptimal generalization.

5.2. Comparison with State of Art

To gain deeper insights, we applied our proposed method to two datasets used in a recent study on small datasets by Izonin et al. [26] and compared the results. This comparison allows for a direct evaluation of our approach’s performance against established benchmarks in similar contexts. Table 3 shows the specifications of the datasets for our comparative analysis.

Table 4 and Table 5, along with Figure 3 and Figure 4, provide a comprehensive analysis of the SBNNR method’s performance compared to previous state-of-the-art methods for small-dataset regression. Table 4 presents the coefficient of determination (

R^{2}

) values for the SBNNR approach in comparison to methods developed by Izonin et al. [26] across two specific datasets: BONE, which involves predicting trabecular bone strength, and FAT, which estimates body fat percentage in women. For the BONE dataset, the proposed SBNNR method achieved a total

R^{2}

score of 0.9285, which is a notable improvement over the score of 0.77 of the approach proposed by Izonin et al. [26]. This higher

R^{2}

indicates that the SBNNR framework better captures the variance within the BONE dataset, highlighting its improved accuracy and enhanced ability to generalize to new data points. Similarly, for the FAT dataset, SBNNR achieved a total

R^{2}

of 0.9835, compared to the value of 0.76 achieved by the method proposed by Izonin et al. [26], showing that SBNNR provides a closer fit to the dataset’s actual distribution, further reinforcing its effectiveness in small-data contexts.

The comparison presented in Table 5 focuses on error metrics—Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE)—to assess prediction reliability across the same datasets. For the BONE dataset, SBNNR demonstrated substantial improvements, with an RMSE of 1.907, an MAE of 1.218, and an MAPE of 0.184, outperforming Izonin et al.’s corresponding values of 3.15, 2.30, and 0.20. These lower error rates suggest that SBNNR reduces prediction errors effectively, making it more robust and accurate for small datasets that often suffer from data irregularities. Similarly, the results on the FAT dataset underscore SBNNR’s advantages, with RMSE, MAE, and MAPE values of 0.635, 0.389, and 0.032, respectively. These metrics are considerably lower than those reported for the method proposed by Izonin et al. [26], underscoring SBNNR’s superior reliability in producing precise predictions, even when handling limited data points.

Figure 3 visually illustrates the alignment between predicted and actual values for the BONE dataset using the SBNNR method. In this figure, red points represent predictions on the test dataset, while blue points show predictions on the training dataset. The green line marks the actual values, providing a benchmark for comparison. Both the red and blue points closely cluster around this line, indicating that SBNNR can effectively capture the dataset’s underlying trends and generalize well across unseen data without overfitting. This close alignment is consistent with the high

R^{2}

score and low error rates reported in Table 4 and Table 5, suggesting that the model successfully balances accuracy with generalizability.

Similarly, Figure 4 depicts the alignment of predicted versus actual values for the FAT dataset under the SBNNR method. The red and blue points remain closely grouped around the green line, showing that SBNNR consistently predicts values that adhere closely to the true data trends. This clustering of predicted values near the actual value line reflects the method’s robustness, affirming its suitability for small-data regression tasks by effectively preventing excessive variance in predictions. Together with the observed low error rates and high

R^{2}

score, this alignment indicates that SBNNR can manage data limitations without compromising on accuracy.

Table 6 presents the cross-validation results for the

R^{2}

values of the BONE and FAT datasets. This table summarizes the performance stability of the small-size bat-optimized nearest neighbor regression (SBNNR) model across multiple cross-validation folds. For the BONE dataset, the mean cross-validation

R^{2}

score is reported as 0.9694, with a standard deviation of 0.01465, indicating high predictive accuracy and minimal variance across folds, suggesting strong generalizability. The FAT dataset shows an even higher mean cross-validation

R^{2}

score of 0.9915, with a lower standard deviation of 0.00882, reinforcing the model’s robustness and precision, especially in small-data contexts.

These results highlight the efficacy of SBNNR in capturing underlying trends within limited datasets. The lower standard deviations imply consistent performance across the different cross-validation splits, which is crucial for applications where data availability is restricted and reliability across samples is critical.

5.3. Computational Cost

The computational complexity of the SBNNR (small-size bat-optimized KNN regression) method can be broken down as follows:

K-Nearest Neighbors (KNN): For a dataset with n instances and d features, a naive KNN has a complexity of $O (n \cdot d)$ for each query point. In SBNNR, instance and feature selection reduce n and d, so if $n^{'}$ and $d^{'}$ represent the reduced instances and features after selection, the adjusted KNN complexity per query is $O (n^{'} \cdot d^{'})$ .
Bat Algorithm (BA): The BA runs iterative optimizations, with each iteration evaluating potential solutions. For T iterations and a population size of P, the BA has a complexity of $O (T \cdot P \cdot n^{'} \cdot d^{'})$ , where $n^{'}$ and $d^{'}$ are the reduced dimensions for instance and feature sets.
Generative Adversarial Networks (GANs): GAN training involves both a generator and a discriminator, each with a training complexity of $O (n^{'} \cdot e \cdot g)$ , where e is the number of epochs and g is the number of layers in the network. As the GAN generates synthetic data, it impacts both training time and complexity based on these parameters.
Deep Neural Networks (DNNs): DNN feature extraction requires forward passes and backpropagation for each layer and sample, with complexity expressed as $O (n^{'} \cdot h \cdot l)$ , where h is the number of hidden units per layer and l is the number of layers.

The combined complexity, considering the components in sequence, can be approximated as follows:

O (T \cdot P \cdot n^{'} \cdot d^{'}) + O (n^{'} \cdot e \cdot g) + O (n^{'} \cdot h \cdot l)

(13)

In practice, SBNNR’s BA optimization step dominates the complexity, especially when T and P are large. The approach is designed to maintain feasible computational demands by minimizing

n^{'}

and

d^{'}

through instance and feature selection.

Table 7 shows the execution times for the SBNNR method across different datasets, detailing both overall execution and final training times. While computational complexity provides insight into the scalability and demands of the proposed approach, practical execution time is a more relevant metric in real-world scenarios. Execution time captures the effects of hardware, implementation efficiencies, and any preprocessing overheads, which are critical for understanding deployment feasibility.

The SBNNR framework, implemented in Python 3.10, leverages optimized libraries and routines that help manage the inherent complexity of the BA, GANs, and DNNs. Python 3.10’s enhancements, including pattern matching and improved runtime performance, contribute to the efficiency of this implementation. Consequently, the actual execution times observed in Table 7 offer a more practical perspective, helping developers anticipate resource requirements and identify potential bottlenecks.

By balancing theoretical complexity with observed runtime performance in Python 3.10, the framework ensures robust modeling for small datasets while remaining feasible for deployment on standard computational setups.

6. Practical Implications

The practical implications of this study highlight the utility of the SBNNR framework, designed specifically to address the unique challenges posed by small datasets in scientific and industrial domains. The framework, grounded in KNN optimized through the bat algorithm, integrates synthetic data generation via GANs and feature extraction through DNNs. This approach effectively overcomes data scarcity issues and enhances prediction accuracy, making it applicable across various fields where data collection is often limited.

One significant application is in biomedical research and diagnostics. Data constraints in this field are common due to the high costs and time requirements associated with clinical data collection. The SBNNR method offers a way to develop predictive diagnostics tools by making more accurate inferences from small datasets. This capability is particularly valuable for early disease diagnosis, personalized treatment, and improving patient outcomes by supporting clinicians with predictive insights based on limited data. By allowing for reliable predictions in medical settings, the SBNNR framework aids in advancing healthcare innovation and improving the quality of patient care.

In environmental monitoring, logistical challenges often limit data collection, especially in remote or resource-constrained areas. SBNNR provides a means for researchers to make reliable predictions based on environmental factors like pollutant levels, climate patterns, or species populations, even with sparse datasets. By enabling more accurate models under data-scarce conditions, the framework supports decision making in critical areas such as climate action, pollution control, and biodiversity conservation, contributing valuable insights for sustainable environmental management.

The SBNNR framework is also applicable in materials science and chemical engineering, where experimental data can be costly and time-consuming to produce. By using synthetic data augmentation and optimized feature selection, SBNNR enhances the accuracy of predictions related to material properties, reaction outcomes, and other key parameters in these fields. This capability facilitates advances in sustainable material development, pharmaceuticals, and chemical processes by providing accurate modeling with limited experimental data, thereby accelerating innovation and reducing development costs.

In manufacturing and industrial operations, predictive maintenance and quality control are essential for optimizing production efficiency and ensuring product quality. However, data collection in these contexts can be constrained to specific production conditions or machinery parameters. SBNNR’s ability to make accurate predictions on small datasets can help industries forecast equipment failures, optimize maintenance schedules, and maintain high standards in quality control. By improving operational decision making, this framework can contribute to reductions in operational costs, enhancements in resource utilization, and the minimization of downtime in industrial settings.

Overall, this study provides a versatile and robust solution for small-data modeling, a common challenge across diverse fields. By enhancing the generalizability and accuracy of predictive models, SBNNR enables more effective, data-driven decision making, even when data are limited. This adaptability positions SBNNR as a valuable tool for the advancement of scientific research, industrial applications, and environmental sustainability, offering significant potential to impact a wide range of data-scarce domains.

7. Conclusions

This study underscores the significance and effectiveness of the small-size bat-optimized nearest neighbor regression (SBNNR) framework, designed specifically to enhance prediction accuracy in small-dataset environments. In combination with KNN optimized by the bat algorithm, the SBNNR approach demonstrates improved model generalizability and robustness, as evidenced by significant gains in the

R^{2}

score across multiple real-world datasets. This improvement, observed in diverse fields with data limitations—such as medical diagnostics, environmental monitoring, and material sciences—affirms the method’s adaptability and high performance. The framework achieves up to a 5% enhancement in

R^{2}

score when compared to traditional KNN optimized with grid search, highlighting its utility in scenarios where small datasets hinder model training and accuracy.

This study paves the way for future research, particularly in the exploration of ensemble methods to further enhance model robustness. Integrating random forest (RF) into the SBNNR framework presents a promising avenue, especially as advancements in synthetic data generation continue to evolve. This integration could employ RF’s ability to capture complex, non-linear relationships, potentially augmenting the predictive performance and resilience of the SBNNR framework in data-scarce environments. While KNN has shown substantial improvement through bat algorithm optimization, RF’s ability to capture more complex, non-linear relationships may complement KNN by improving predictive accuracy and resilience against small-data limitations. Additionally, leveraging RF in conjunction with feature selection, instance selection, and hyperparameter optimization could further elevate performance, especially in complex domains where small datasets exhibit high variance. Future work may also focus on optimizing the GAN and DNN components to better support enhanced data augmentation and feature extraction, further bolstering the SBNNR framework’s utility across fields that rely on small datasets.

Author Contributions

Conceptualization, R.S.; methodology, R.S.; software, R.S.; validation, R.S.; formal analysis, R.S.; investigation, R.S.; resources, J.G. and X.M.-B.; data curation, R.S.; writing—original draft preparation, R.S.; writing—review and editing, J.G., X.M.-B. and J.K.; visualization, R.S.; supervision, J.G. and X.M.-B.; project administration, J.G. and X.M.-B.; funding acquisition, R.S., J.G. and X.M.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Spanish Ministry of Science and Innovation under grant PID2021-124463OB-I00, funded by MCIN/AEI/10.13039/501100011033 and by ERDF “A way of making Europe”, by the European Union’s Horizon Europe under the HE ICOS project, Grant Agreement no. 101070177, and by the Catalan Government under contract 2021 SGR 00326. The corresponding author R.S. gratefully acknowledges the Universitat Politècnica de Catalunya and Banco Santander for the financial support of his predoctoral grant FPI-UPC 2021.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fu, F.; Alagumalai, A.; Mahian, O. Sustainable biodiesel production from waste cooking oil: ANN modeling and environmental factor assessment. Sustain. Energy Technol. Assessments 2021, 46, 101265. [Google Scholar]
He, M.; Zhang, L. Machine learning and symbolic regression investigation on stability of MXene materials. Comput. Mater. Sci. 2021, 196, 110578. [Google Scholar] [CrossRef]
Li, D.C.; Lin, L.S.; Chen, C.C.; Yu, W.H. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Comput. 2019, 23, 11883–11900. [Google Scholar] [CrossRef]
Li, D.C.; Wen, I.H. A genetic algorithm-based virtual sample generation technique to improve small data set learning. Neurocomputing 2014, 143, 222–230. [Google Scholar] [CrossRef]
Sutojo, T.; Syukur, A.; Rustad, S.; Shidik, G.F.; Santoso, H.A.; Purwanto, P.; Muljono, M. Investigating the Impact of Synthetic Data Distribution on the Performance of Regression Models to Overcome Small Dataset Problems. In Proceedings of the 2020 International Seminar on Application for Technology of Information and Communication (iSemantic), IEEE, Semarang, Indonesia, 19–20 September 2020; pp. 125–130. [Google Scholar]
Zadeh, L.A. Information and control. Fuzzy Sets 1965, 8, 338–353. [Google Scholar]
Huang, C. Information diffusion techniques and small-sample problem. Int. J. Inf. Technol. Decis. Mak. 2002, 1, 229–249. [Google Scholar] [CrossRef]
Huang, C.; Moraga, C. A diffusion-neural-network for learning from small samples. Int. J. Approx. Reason. 2004, 35, 137–161. [Google Scholar] [CrossRef]
Shaikhina, T.; Khovanova, N.A. Handling limited datasets with neural networks in medical applications: A small-data approach. Artif. Intell. Med. 2017, 75, 51–63. [Google Scholar] [CrossRef]
Zhang, Y.; Ling, C. A strategy to apply machine learning to small datasets in materials science. NPJ Comput. Mater. 2018, 4, 1–8. [Google Scholar] [CrossRef]
Chapelle, O.; Vapnik, V.; Bengio, Y. Model selection for small sample regression. Mach. Learn. 2002, 48, 9–23. [Google Scholar] [CrossRef]
Rodríguez-Fdez, I.; Mucientes, M.; Bugarín, A. An instance selection algorithm for regression and its application in variance reduction. In Proceedings of the 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, Hyderabad, India, 7–10 July 2013; pp. 1–8. [Google Scholar]
Guillén, A.; Herrera, L.J.; Rubio, G.; Pomares, H.; Lendasse, A.; Rojas, I. New method for instance or prototype selection using mutual information in time series prediction. Neurocomputing 2010, 73, 2030–2038. [Google Scholar] [CrossRef]
Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
Arnaiz-González, Á.; Blachnik, M.; Kordos, M.; García-Osorio, C. Fusion of instance selection methods in regression tasks. Inf. Fusion 2016, 30, 69–79. [Google Scholar] [CrossRef]
Song, Y.; Liang, J.; Lu, J.; Zhao, X. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing 2017, 251, 26–34. [Google Scholar] [CrossRef]
Ahn, H.; Kim, K.J. Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach. Appl. Soft Comput. 2009, 9, 599–607. [Google Scholar] [CrossRef]
Ros, F.; Guillaume, S.; Pintore, M.; Chrétien, J.R. Hybrid genetic algorithm for dual selection. Pattern Anal. Appl. 2008, 11, 179–198. [Google Scholar] [CrossRef]
Ho, S.Y.; Liu, C.C.; Liu, S. Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm. Pattern Recognit. Lett. 2002, 23, 1495–1503. [Google Scholar] [CrossRef]
Kuncheva, L.I.; Jain, L.C. Nearest neighbor classifier: Simultaneous editing and feature selection. Pattern Recognit. Lett. 1999, 20, 1149–1156. [Google Scholar] [CrossRef]
Pedrycz, W.; Ahmad, S.S. Evolutionary feature selection via structure retention. Expert Syst. Appl. 2012, 39, 11801–11807. [Google Scholar] [CrossRef]
Aydogan, E.K.; Karaoglan, I.; Pardalos, P.M. hGA: Hybrid genetic algorithm in fuzzy rule-based classification systems for high-dimensional problems. Appl. Soft Comput. 2012, 12, 800–806. [Google Scholar] [CrossRef]
Rattá, G.; Vega, J.; Murari, A.; Castro, P.; Johnson, M.F.; JET Contributors. Improved feature selection based on genetic algorithms for real time disruption prediction on JET. Fusion Eng. Des. 2012, 87, 1670–1678. [Google Scholar] [CrossRef]
Das, N.; Sarkar, R.; Basu, S.; Kundu, M.; Nasipuri, M.; Basu, D.K. A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application. Appl. Soft Comput. 2012, 12, 1592–1606. [Google Scholar] [CrossRef]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11. [Google Scholar]
Izonin, I.; Tkachenko, R.; Berezsky, O.; Krak, I.; Kováč, M.; Fedorchuk, M. Improvement of the ANN-Based Prediction Technology for Extremely Small Biomedical Data Analysis. Technologies 2024, 12, 112. [Google Scholar] [CrossRef]
Nakamura, R.Y.; Pereira, L.A.; Costa, K.A.; Rodrigues, D.; Papa, J.P.; Yang, X.S. BBA: A binary bat algorithm for feature selection. In Proceedings of the 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images, IEEE, São Paulo, Brazil, 24–27 September 2012; pp. 291–297. [Google Scholar]
Saleem, N.; Zafar, K.; Sabzwari, A.F. Enhanced feature subset selection using Niche based bat algorithm. Computation 2019, 7, 49. [Google Scholar] [CrossRef]
Jeong, I.; Kim, Y.; Cho, N.J.; Gil, H.W.; Lee, H. A Novel Method for Medical Predictive Models in Small Data Using Out-of-Distribution Data and Transfer Learning. Mathematics 2024, 12, 237. [Google Scholar] [CrossRef]
Conrad, F.; Mälzer, M.; Schwarzenberger, M.; Wiemer, H.; Ihlenfeldt, S. Benchmarking AutoML for regression tasks on small tabular data in materials design. Sci. Rep. 2022, 12, 19350. [Google Scholar] [CrossRef]
Dey, N.; Rajinikanth, V. Applications of Bat Algorithm and Its Variants; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Yang, X.S. A new metaheuristic bat-inspired algorithm. In Nature Inspired Cooperative Strategies for Optimization (NICSO 2010); Springer: Berlin/Heidelberg, Germany, 2010; pp. 65–74. [Google Scholar]
Gupta, D.; Arora, J.; Agrawal, U.; Khanna, A.; de Albuquerque, V.H.C. Optimized Binary Bat algorithm for classification of white blood cells. Measurement 2019, 143, 180–190. [Google Scholar] [CrossRef]
Kramer, O. K-nearest neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Springer: Berlin/Heidelberg, Germany, 2013; pp. 13–23. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Gouda, S.G.; Hussein, Z.; Luo, S.; Yuan, Q. Model selection for accurate daily global solar radiation prediction in China. J. Clean. Prod. 2019, 221, 132–144. [Google Scholar] [CrossRef]
Zhu, Z.; Liu, Y.; Cong, W.; Zhao, X.; Janaun, J.; Wei, T.; Fang, Z. Soybean biodiesel production using synergistic CaO/Ag nano catalyst: Process optimization, kinetic study, and economic evaluation. Ind. Crop. Prod. 2021, 166, 113479. [Google Scholar] [CrossRef]
Perilli, E.; Baleani, M.; Öhman, C.; Baruffaldi, F.; Viceconti, M. Structural parameters and mechanical strength of cancellous bone in the femoral head in osteoarthritis do not depend on age. Bone 2007, 41, 760–768. [Google Scholar] [CrossRef] [PubMed]
Salodkar, V. Body Fat Percentage of Women Dataset. 2023. Available online: https://www.kaggle.com/datasets/fedesoriano/body-fat-prediction-dataset (accessed on 21 October 2024).

Figure 1. Predicted (red points for test data and blue points for train data) compared to actual values (green line) of GridCV optimized KNN on the Servo dataset.

Figure 2. Predicted (red points for test data and blue points for train data) compared to actual values (green line) of the proposed method on Servo dataset.

Figure 3. Predicted (Red Points for test data and blue points for train data) compared to Actual values (Green line) Proposed method on BONE dataset.

Figure 4. Predicted (red points for test data and blue points for train data) compared to actual values (green line) of the proposed method on the FAT dataset.

Table 1. Datasets for experimental assessment.

Dataset	Instances	Features
Biodiesel Production (BP) [37]	20	2
Servo	167	4
Yacht Hydrodynamics (YH)	308	7

Table 2. Comparison of KNN with GridCV and the proposed method.

Dataset	Method	Train $R^{2}$	Test $R^{2}$	RMSE
BP	KNN (GridCV)	$0.999$	$0.662$	$1.74 \times 10^{- 4}$
	Proposed Method	$0.987$	$0.696$	$1.74 \times 10^{- 4}$
Servo	KNN (GridCV)	$1.0$	$0.969$	$2.49 \times 10^{- 1}$
	Proposed Method	$0.970$	$0.987$	$1.57 \times 10^{- 1}$
YH	KNN (GridCV)	$1.0$	$0.792$	$5.042$
	Proposed Method	$0.947$	$0.812$	$5.014$

Table 3. Datasets for comparative assessment.

Dataset	Instances	Features	Source
Prediction of trabecular bone strength in severe osteoarthritis (BONE)	35	5	[38]
Prediction of body fat percentage of women (FAT)	24	7	[39]

Table 4. Comparison of state-of-the-art methods with the proposed method (

R^{2}

).

Table 4. Comparison of state-of-the-art methods with the proposed method (

R^{2}

).

Dataset	Method	Train $R^{2}$	Test $R^{2}$	Total $R^{2}$
BONE	Izonin et al. [26]	-	-	0.77
	Proposed Method	0.9775	0.7681	0.9285
FAT	Izonin et al. [26]	-	-	0.86
	Proposed Method	0.9967	0.8351	0.9835

Table 5. Comparison of state-of-the-art methods with the proposed method (error rates).

Dataset	Method	RMSE	MAE	MAPE
BONE	Izonin et al. [26]	3.15	2.30	0.20
	Proposed Method	1.907	1.218	0.184
FAT	Izonin et al. [26]	1.04	1.00	0.07
	Proposed Method	0.635	0.389	0.032

Table 6. Cross-validations results (

R^{2}

).

Table 6. Cross-validations results (

R^{2}

).

Dataset	Mean CV $R^{2}$	$R^{2}$ Standard Deviation
BONE	0.9694	0.01465
FAT	0.9915	0.00882

Table 7. Execution times for the SBNNR method across various datasets on a Core-i7 Intel CPU, detailing overall and final training durations.

Dataset	Overall SBNNR Execution (Seconds)	Final Training Time (Seconds)
BONE	5195	8
FAT	5608	8
BP	4340	8
Servo	7764	14
YH	10,525	23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seyghaly, R.; Garcia, J.; Masip-Bruin, X.; Kuljanin, J. SBNNR: Small-Size Bat-Optimized KNN Regression. Future Internet 2024, 16, 422. https://doi.org/10.3390/fi16110422

AMA Style

Seyghaly R, Garcia J, Masip-Bruin X, Kuljanin J. SBNNR: Small-Size Bat-Optimized KNN Regression. Future Internet. 2024; 16(11):422. https://doi.org/10.3390/fi16110422

Chicago/Turabian Style

Seyghaly, Rasool, Jordi Garcia, Xavi Masip-Bruin, and Jovana Kuljanin. 2024. "SBNNR: Small-Size Bat-Optimized KNN Regression" Future Internet 16, no. 11: 422. https://doi.org/10.3390/fi16110422

APA Style

Seyghaly, R., Garcia, J., Masip-Bruin, X., & Kuljanin, J. (2024). SBNNR: Small-Size Bat-Optimized KNN Regression. Future Internet, 16(11), 422. https://doi.org/10.3390/fi16110422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SBNNR: Small-Size Bat-Optimized KNN Regression

Abstract

1. Introduction

2. Related Works

3. Preliminaries

3.1. BatAlgorithm

3.2. K-Nearest Neighbors

4. Proposed Method

4.1. KNN Hyperparameters

4.2. Instance Selection

4.3. Feature Selection

4.4. Deep Neural Networks (DNNs) for Feature Extraction

4.5. Generative Adversarial Networks (GANs) for Synthetic Data Generation

4.6. Proposed SBNNR Method

5. Experimental Results

5.1. Comparison with Grid Search-Optimized KNN

5.2. Comparison with State of Art

5.3. Computational Cost

6. Practical Implications

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI