1. Introduction
The worldwide automotive aftermarket industry was estimated to be worth 342 million dollars in 2022. According to projections, the market encompassing light, medium, and heavy-duty vehicles is expected to grow significantly, with a compound annual growth rate of 5.5%. This expansion points to an important future for the automotive aftermarket industry. Currently, vehicle part suppliers contribute significantly to this industry, manufacturing 77% of the value found in contemporary automobiles. Looking ahead to 2030, a transformative shift is anticipated as an estimated 95% of newly sold vehicles worldwide are expected to be connected. This projection underscores the increasing integration of technology and connectivity within the automotive sector [
1].
In the automotive aftermarket industry, visual inspections and sensor data have traditionally played a significant part in our diagnostic capacity to assess, for example, the condition of internal combustion engines [
2]. However, as current engines become more complex, we will need to use more current techniques for the diagnostic of these signals [
3]. Engineers and researchers can use the diverse noises made by engines at various stages of operation as an audible fingerprint to detect problems such as ignition faults [
4], irregular valve timing, and even approaching component failures [
5]. Using audio categorization adds a new level of diagnostic capacity and provides a real-time, non-intrusive method to check engine health [
6].
Although automotive engine ignition systems are built differently, they always follow the same basic principles of operation. Every system has a primary circuit that causes the secondary circuit to spark. The next step is to transfer this spark to the appropriate spark plug at the exact moment. Conditions in the cylinder and ignition system impact the secondary circuit’s ignition pattern, also known as the scope pattern. A scope meter, often called an automotive oscilloscope, is a useful tool for diagnosing engine and ignition problems. It does this by displaying scope patterns that make it possible to conduct a thorough examination of the ignition system’s operation [
7].
Regarding the detection of faults in internal combustion engines in automobiles, it should be noted that the process of capturing these ignition patterns often involves manual intervention by mechanics [
8]. Subsequently, the captured pattern is compared with reference samples from handbooks to facilitate diagnosis. However, this diagnostic process heavily relies on domain knowledge and user experience since handbook samples serve as references only. The challenge arises from the fact that ignition patterns are dynamic and non-stationary time series [
9], varying in amplitude and duration across different engine models experiencing the same ignition system trouble. Even within the same engine, different patterns may happen over distinct operating conditions, confounding diagnosis. Additionally, many engine faults manifest similar ignition patterns, further making it difficult the accurate identification of issues [
10].
Mechanics expend significant time and effort due to the inherent inaccuracies in human diagnosis and the necessity for multiple trials involving the disassembling and assembling of engine parts. To address this, our proposed solution involves the generation of a ML-based pattern classification system, aimed at assisting automotive mechanics. ML models could analyze and interpret the acoustic signals produced during engine operation in the context of internal combustion engines for audio classification. Such algorithms can be designed to identify between multiple engine states, detecting anomalies, probable failures, and normal operation [
11].
To extract meaningful information from the complex auditory landscape of engines, researchers employ methodologies such as spectrum analysis, frequency domain approaches, and deep learning models. These acoustic data and artificial intelligence not only enhanced the understanding of engine behavior but also helped to make more informed decisions in terms of maintenance, efficiency optimization, and overall performance. While there are many potential benefits to audio classification in internal combustion engines, there are also challenges to the field. The inherently noisy environment of an operating engine poses difficulties in isolating specific sounds and patterns [
12].
Variations in engine designs, fuel types, and operating conditions add complexity to the task of developing robust and generalized classification models. Overcoming these challenges requires a multimodel approach, combining acoustics, signal processing, and ML to generate efficient solutions The combination of audio categorization of internal combustion engines offers a significant advancement in the search for a better understanding of vehicle systems. Based on these aspects, the use of acoustics and artificial intelligence has the potential to revolutionize the future of vehicle maintenance [
13].
Compression ignition engines have become known as both reliable and efficient. However, like any complex device, engines are subject to various failures that can affect their performance and durability. For example, mechanical failures can include problems with the piston, connecting rod, crankshaft, valves, and piston rings, among other internal components. These faults can result in abnormal vibrations, metallic noises, or engine knocking. Another example of failure could be injector problems, such as clogging, leakage, or malfunction, which can lead to inadequate performance, including lack of power and altered emissions [
14].
Selecting the most appropriate classifier can be challenging, as each model can perform better depending on its application [
15]. To obtain the best possible result, several state-of-the-art classifiers are considered in this paper. The classifier is defined by the best performance result achieved by the model, considering the use of the optimized hyperparameters, thus ensuring that an adequate model is obtained for the proposed task.
Klaar et al. [
16] integrates ROCKET with ML classifiers and empirical mode decomposition techniques like Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN), Empirical Wavelet Transform (EWT), and Variational Mode Decomposition (VMD). Results demonstrate the accuracy improvement achieved by combining these methods, with accuracies of 0.992 using CEEMDAN, 0.995 using EWT, and 0.980 considering VMD, highlighting the enhanced potential for insulator failure detection in power systems.
Analyzing the audio signature of these motor failures can help correctly diagnose which component is at fault. In this paper, audio data were collected during normal operation and under controlled failure conditions of compression ignition engines of different vehicle models. An analysis of the audio signals is performed to identify the features that can be used to classify standard operations and specific faults based on ML approaches. The main contributions of this paper are summarized as follows:
Experiments were performed on three different vehicle models, and original data were collected from audio signals containing normal and engine failure situations.
A hybrid approach combining the wavelet packet transform, Markov blanket feature selection, RandOm convolutional kernel transform, tree-structured Parzen estimator for hyperparameters tuning combined with ten machine learning classifiers was proposed.
Wavelet packet transform was used to extract features from the audio signals, providing a detailed analysis of the frequencies associated with mechanical failures.
Audio signals were classified by comparing the performance of ten machine learning models using the tree-structured Parzen estimator to explore the hyperparameter space. Hold-out and k-fold cross-validation strategies were applied.
The proposed hybrid approach evaluates multiple classifiers, highlighting that three models perform well. These models demonstrate balanced performance, making them suitable for engine fault diagnosis that is evaluated here.
Results indicate that RF, GBM, and LightGBM are promising alternatives for diagnosing engine faults from acoustic signals.
The remainder of this paper is organized as follows:
Section 2 briefly presents the literature on engine fault diagnosis based on audio signal processing.
Section 3 introduces the audio data source.
Section 4 explains the proposed method, considering the feature selection method, preprocessing, hypertuning, optimization methods, and the considered classifiers. Numerical experiments applying different setups on the proposed method are presented and discussed in
Section 5. Lastly,
Section 6 summarizes this paper and outlines future work.
2. Related Works
Artificial intelligence algorithms for classification may be applied to estimate engine diagnosis. It works by using machine and deep learning models to categorize and identify potential issues or conditions within an internal combustion engine based on input data. It is important to note that the success of engine diagnosis classification relies on the quality and representativeness of the training data, the choice of appropriate features, and the selection of an effective ML model. Additionally, domain knowledge and expertise play a crucial role in interpreting the results and refining the model for accurate and reliable engine fault diagnosis [
17].
Usually, standard classification models face the following drawbacks: (i) weak interpretability; (ii) underfitting behavior, (iii) unbalanced dataset, (iv) the susceptibility to outliers, and (vi) the need for hyperparameter tuning. Comprehending these limitations is imperative for specialists to make intelligent decisions regarding the selection and execution of classification models, considering the distinctive features and difficulties of their datasets and applications. Studies in the automotive literature on audio classification differ in terms of the sensors types used to collect data [
18], how features are extracted, and how classification approaches are used. In general, many frequency domain-based approaches use image processing techniques for classification, while time domain approaches can be combined with many approaches, such as ML, deep learning, and hybrid models.
In recent years, researchers have actively adopted a variety of machine learning techniques for diagnosing engine faults. Cruz-Peragon et al. [
19] utilized instantaneous engine speed and an artificial neural network (ANN) to identify the faulty cylinder in cases of misfire or abnormal combustion. The results demonstrated accuracy in obtaining engine characteristics such as the cylinder pressure curve, fuel consumption, and ignition time, allowing for the successful isolation of the faulty cylinder. In a diverse traffic environment characteristic of India, George et al. [
20] employed an ANN and
k-NN to detect and categorize about 160 cars belonging to three categories—heavy, medium, and light—from an audio signal. Mel-Frequency Cepstral Coefficients (MFCCs) were extracted for the detection of regions around peaks. The average accuracy obtained was 73.42%. Alexandre et al. [
21] proposed an application capable of classifying vehicles based on the sound they produce to keep traffic noise within tolerable levels for human health and improve intelligent transportation systems for short-term traffic flow forecasting. The classifier based on extreme machine learning and a genetic algorithm performs with an average classification probability of 93.74% when using the optimal subset of selected features.
Kerekes et al. [
22] explored a multimodal detection technique using 14 characteristics, sensor data with electromagnetic emanations, and audio signatures for vehicle classification and identification. A supervised kernel regression method was used to classify and identify vehicles without the need for invasive images, with an average accuracy of 86% to 94%. Using frequency spectrum data from acoustic signals in wireless acoustic sensor networks, Ntalampiras [
23] developed an acoustic classifier based on the echo state network for moving automobiles, achieving an average accuracy of 96.3%. To collect fault data without inducing failure in a real engine, Rubio et al. [
24] constructed a diesel engine failure simulator based on a one-dimensional thermodynamic model, replicating engine behavior under failure conditions, with most of the simulated responses aligning closely with the experience database.
The behavior of drivers near control points was observed by Kubera et al. [
25] to check whether driving is safe both when approaching the radar and after passing it. The data were classified into three classes: car acceleration, decelerating, or with constant speed. The SVM, RF, and ANNs were used as classifiers through time-series-based approaches. An ensemble classifier was proposed and achieved an accuracy of almost 95%.
Yiwere and Rhee [
26] investigated the sound source distance estimate utilizing a convolutional recurrent neural network as an image classification task. After converting the audio signals into MFCC format, the classification model shows excellent classification accuracy, with 88.23%. Tran and Tsai [
27] developed an automatic detection system to recognize emergency cars when their sirens sound, alerting other drivers to pay attention. With WaveNet and MLNet, a convolutional neural network (CNN)-based ensemble model was generated to identify sounds using a combination feature made from MFCC and log-mel spectrogram, with a 98.24% classification accuracy. An innovative method for defect diagnostics of automobile power seats using a smartphone was proposed by Huang et al. [
28]. The
k-NN and SVM models were utilized to identify faults with superiority. Recently, a 1D-CNN model with 98.37% accuracy was presented by Parineh et al. [
29] for emergency vehicle identification.
Zhao et al. [
30] introduced a novel diesel engine fault diagnosis method for multiple operating conditions. The researchers enhanced the condition adaptability of fault diagnosis by incorporating the Mel frequency transform and adaptive correlation threshold processing into vibrational mode decomposition and MFCC frameworks. Subsequently, they employed the
k-NN for classification. Cai et al. [
31] integrated a rule-based algorithm with Bayesian networks/back propagation neural networks for diagnosing faults across a broad spectrum of rotation speeds, utilizing training data derived from fixed speeds. Furthermore, they employed a novel data-driven diagnostic method based on Bayesian networks for diagnosing permanent magnet synchronous motor issues [
32], demonstrating that the proposed methods exhibited effective diagnostic performance for early faults.
Stoumpos and Theotokatos [
33] introduced a methodology that combined thermodynamic, functional control, and neural networks data-driven models for engine health management. The proposed method demonstrated the ability to capture engine sensor anomalies and make corrections. Kong et al. [
34] conducted fault diagnosis studies on closed-loop control systems using dynamic Bayesian networks. Analyzing the studies presented previously, it is possible to highlight that ANNs have made considerable progress in the recent decade when applied to vehicle audio analysis. However, this literature review identifies research gaps that should be addressed. Common classification algorithms such as RF,
k-NN, SVM, and neural networks are viable; however, they have advantages and disadvantages.
While the ANNs may have high accuracy, they struggle with prolonged learning time. Furthermore, the learning process of the neural network remains usually unobservable, and the resulting outputs are challenging to interpret, adversely affecting the credibility and acceptability of the results. The
k-NN, categorized as lazy learning, demands extensive computation and lacks interpretability for classification results. The SVM, though needing fewer samples, primarily applies to two-class classification. However, diesel engine fault diagnosis requires multi-class classification, typically necessitating the combination of multiple two-class SVMs. In contrast, the RF can elucidate the classification process, support the theoretical analysis of diesel engine faults, and satisfy the requirements of multi-class classification for diesel engine faults. There is a trade-off between the complexity of the models and the computational processing cost [
35]. Consequently, this study employs ten ML models for classifying the working state of the engine. Using an optimized hybrid method, the best combinations of models are employed.
The use of a suitable sensor can improve the quality of a measurement, thus improving the model’s performance results. A lot of technology has been presented to obtain the best possible measurement based on the latest technology sensors. According to Zhao et al. [
36], the quality of data plays a crucial role in the accurate classification of faults in systems that rely on sound data. High-quality sensors capture a broader and more precise range of frequencies, enabling the detection of subtle anomalies or variations in sound patterns that may signal mechanical or operational faults.
In contrast, low-quality sensors may introduce noise, distort important sound frequencies, or fail to capture certain acoustic signatures, leading to inaccurate or incomplete data [
37]. This can result in the misclassification of faults or missed detections, reducing the overall effectiveness of fault diagnosis systems. High-resolution sensors, therefore, provide more reliable data for advanced analysis approaches, such as ML models, improving the accuracy and robustness of fault classification.
Regarding fault classification in combustion motors, in [
38], the analysis utilizes time-frequency signal processing techniques like the fast Fourier transform and short-time Fourier transform, combined with ML algorithms, to detect and classify faults such as scuffing in engine components. The hybrid approach enhances diagnostic accuracy and efficiency, offering robust solutions for real-time engine fault identification, as demonstrated by high-performance results.
An overview of the integration of ML models in diagnosing faults in mechanical systems is presented in [
39]. This review highlights the advancements in ML, particularly focusing on how deep learning and transfer learning enhance fault detection. The authors identify the challenges posed by imbalanced data in industrial systems and propose solutions to improve diagnostic performance. The selection of the appropriate model for fault identification can be a difficult task, and several approaches have been proposed to improve the classification of faults in engines and monitor their condition. Kefalas et al. [
40] proposed the use of GBM combined with the discrete wavelet transform. By using extreme gradient boosting, they obtained promising results.
When analyzing the methods that make up a hybrid method, it is not possible to evaluate an individual method and specify which is the best, i.e., there is no single method that is the best, but it is the combination of these methods that makes the performance of the proposed hybrid method superior to the individual methods. Thus, the suggested hybrid approach combines hyperparameter tuning, feature selection (Markov blanket), data processing using WPT, and ROCKET together with 10 machine learning techniques for audio classification. This combination ensures that the performance of the classifiers is maximized and, at the same time, allows the model to collect the audio information accurately. In this way, the main advantage of the hybrid method is the combination of complementary techniques that allow both adequate feature extraction and precise model tuning, which leads to better classification performance.
3. Description of the Multi-Classification Task
The raw audio signal was collected experimentally from three vehicle models, with two vehicles selected from each model, all with compression ignition engines. Data on the sound of vehicle engines were collected in 50 specialized mechanical workshops, in 100 samples. Considering that this study seeks to enable the reproduction of experiments by users of vehicles with possible engine failures, we decided to use a smartphone device for data collection, which had a directional microphone located in the lower part of the device, with a response frequency ranging from 20 Hz to 20 kHz, for recording audio. Audio signals were collected from the vehicles at two different speeds, at idle (800–1000 rpm) and full load (2500–3000 rpm). The rotations chosen to capture engine audio were influenced by three factors:
(i) The first factor is related to the ease of the user being able to capture the vehicle’s audio even if they need to leave the vehicle in a safe condition without needing to be inside the vehicle;
(ii) The second factor is related to one of the impediments caused by injector failure, the vehicle’s electronic control center, when it detects the loss of engine power, limits the number of rotations to ensure that there is no damage;
(iii) The last factor, both rotations, was chosen to evaluate and understand the performance of the vehicle types.
The audio data collected contain ambient noise, which makes classification more challenging. To solve this issue, filters were used to reduce the non-linearities in the time series, as explained in the next section. For the selected vehicles, three different conditions were considered, such as the intake hose disconnected (class 1), one of the cylinders with the injector turned off (class 2), and no anomalies (class 3), as illustrated in
Figure 1.
Figure 2 presents the position of audio collection in vehicles, which was obtained in an external environment, in three different positions, two external, in front of the vehicle and next to the driver, and another internal. In the position of the vehicle driver, i.e., position (1) is the left or right side of the vehicle, position (2) in front of the vehicle, and position (3) inside the vehicle, front passenger. The recordings were made with a sampling rate of 22.05 kHz with files in the M4A format; for data compression, these files were converted to the WAV format.
Three different vehicle models were selected for data collection: two vans and a truck. To balance the classes, 550 s of each audio signal was segmented, and the dataset was normalized.
Figure 3 shows the original audio signal with greater precision for normal conditions, injector off, and air intake hose off, respectively, showing one second of sampling; however, the audio was sampled for a total of 20 min.
The sound signal is represented in the temporal domain by one-dimensional data vectors (
X),
where
n is the total number of data samples in the dataset. Each data sample consists of a vector with length
l, accompanied by a label
, the classes, and can be represented as
where predicting the class label
Y from the input vector
X is the main goal of the proposed classification models.
The considered method is developed to be applied using a smartphone, which in the future could be embedded in an application to evaluate the motor condition based on the audio generated by it. The performance of this approach would be dependent on the quality of the device since low-quality microphones would imply poor performance, compared to high-quality microphones available on smartphones used currently. The quality of the microphone can introduce variability into the acoustics extracted from audio recordings. Microphones pick up varying ranges of sound frequencies with different levels of accuracy; sensitivity also determines how effectively the microphone picks up sound pressure levels. They also differ in terms of inherent noise generation, including self-noise and susceptibility to environmental noise. In this sense, smartphone microphones were used and measurements were made in a controlled environment and in triplicate to extract clear and reliable resources. The influence of microphone variability on classifier performance can be assessed using performance metrics—accuracy, precision, and recall—and it is possible to identify whether certain microphone types lead to a substantial degradation in classifier performance.
5. Results and Discussion
The experiments were computed using Google Colab, considering a central processing unit of 2.30 GHz with 2 cores and 12 GB of random access memory. The analysis was performed using the 32, 28, and 40 samples from the 100 samples that were obtained experimentally for classes 1, 2, and 3.
In terms of the audio signal transformations, the WPT generates 576 features. After, the feature selection based on the Markov blanket approach selects 56 predictive features. The selected features are employed in ROCKET, which utilizes 3000 kernels, balancing the trade-off between classification accuracy and processing efficiency. Dimension 2940 represented the output signal produced by ROCKET. This signal was fed into the ten ML classifiers that were evaluated using the ROCKET approach.
Table 1 presents the classification results for various classifiers used in the ROCKET structure with a dataset split of 90% for training and 10% for testing. In terms of accuracy, ROCKET, using the ridge model, presented a training accuracy of 100%, while the testing accuracy was 80%. This suggests a potential overfitting issue. QDA obtained both training and test accuracies that were relatively low, indicating that the model may not be capturing the underlying patterns in the data well. Other classification models including NB,
k-NN, SVM, MLP, RF, ET, GBM, and LightGBM performed with high training and test accuracies, suggesting solid overall model performance.
Regarding recall, the ridge had high values for both training and test sets, indicating an effective ability to capture true positive instances. QDA presented low values, especially on the testing set. Possibly, the model might have been missing a significant number of positive instances. The other evaluated models exhibited generally high recall values, showing good performance in capturing true positive instances. Ridge demonstrated high precision on both training and test sets, signifying a minimal occurrence of false positives. QDA exhibited low precision values, particularly on the test set, implying a higher incidence of false positives. Other evaluated classification models showcased consistently high precision values, indicating a low frequency of false positives across these models.
Observing the F1-score, the ridge model achieved promising F1-score values, skillfully maintaining a harmonious equilibrium between precision and recall. QDA registered suboptimal F1-score values, signaling inadequate overall model performance in terms of precision and recall. The other models consistently demonstrated robust F1-score values, indicative of an equilibrium between precision and recall. The time indicates the time taken for each model to be trained. MLP has the highest training time, followed by LightGBM and RF. Other evaluated models have relatively lower training times. The fastest model was NB, requiring 27 s to be trained.
The three models perform well on the training dataset, achieving high accuracy and minimal misclassifications. On the testing set, these models maintain good accuracy, but there are some misclassifications. The ridge model has a few misclassifications in Class 1 during testing. The GBM and LightGBM models have no misclassifications in testing but have a small difference in predictions for Class 3. The LightGBM model has a slightly better performance on the testing dataset than the other models.
The search space used in hyperparameter tuning in all experiments of hold-out and
k-fold cross-validation strategies via the Optuna framework using TPE with 300 trials is presented in
Table 2. The best tuning of the hyperparameters using the Optuna TPE optimizer in combination with ROCKET for 90% training and 10% test splitting is presented in
Table 3. The ridge, QDA, and NB presented relatively simple with a single hyperparameter being tuned. There was no great difference between the use of the cross-validation method in the classification assessment. In the variations of data splitting (90/10 and 70/30), cross-validation was not considered.
The k-NN had two hyperparameters tuned, suggesting a moderate level of complexity in controlling the number of neighbors and the leaf size of the tree. SVM shows a higher level of complexity with multiple hyperparameters tuned, indicating flexibility in kernel selection and regularization. MLP has three hyperparameters tuned, reflecting the architecture of an ANN with specified units, activation function, and regularization strength. RF, ET, GBM, and GBM had a considerable level of complexity with numerous hyperparameters tuned.
Classification results from multiple classifiers in the ROCKET framework are presented in
Table 4, where the dataset is split 80% for training and 20% for testing. An analysis of the key performance metrics for each model showed that the ridge had high training accuracy but lower test accuracy (0.60), indicating potential overfitting. This also happened considering recall and precision. QDA obtained moderate accuracy for both training and test sets, and low precision for both sets, suggesting a higher rate of false positives. The F1-score was relatively low for the QDA too. NB exhibited moderate accuracy, with better performance on the training dataset. Balanced recall and precision for both sets were obtained with the NB approach.
Higher accuracy on the test set was achieved by the k-NN classifier as compared to the training set. Both sets exhibit high recall, but the test set provided more precision. The SVM presented excellent training accuracy but lower test accuracy, indicating possible overfitting, high precision, and recall for both sets. MLP seemed to be exhibiting overfitting behavior, evidenced by its high training accuracy but inadequate test accuracy. Moreover, MLP consistently showed F1-score, recall, and precision for both sets.
For the tree-structure models, RF behaved well on both datasets. Additionally, it revealed excellent results in the classification with good precision, recall, and F1-score. For all the training and test sets, ET achieved moderate accuracy with good precision, recall, and F1-score. The GBM produced a little lower test accuracy but high training accuracy. Additionally, for both sets, GBM exhibited outstanding recall, precision, and F1-score. On both sets, LightGBM achieved good accuracy along with consistent precision, recall, and F1-score. Models that include SVM and MLP have been found to require higher training times than other classification models in terms of their processing time.
Like ridge, the GBM model correctly predicts one instance of Class 2 but misclassifies others on the training data related to Classes 1 and 3. High adaptability to the testing data can be observed by all three models, with only a few misclassifications. If compared to ridge and GBM, the LightGBM model exhibited a slightly greater number of misclassifications.
Table 5 presented the best hyperparameter tuning for 80% training and 20% test splitting using the Optuna TPE together with the ROCKET approach and ML classifiers. A model’s complexity needs to be taken into account when evaluating computing resources, training time, and overfitting risk.
Table 6 presents classification results from multiple classifiers in the ROCKET framework with a split of 70% for training and 30% for testing. The ridge model achieved an accuracy of 100% on the training set but dropped significantly to 0.57 on the test set. This suggests potential overfitting of the training data. QDA provides comparable results on test and training sets, suggesting a well-balanced model, but with poor performance. On the test set, naive Bayes’ accuracy decreased. On the training set,
k-NN achieved perfect accuracy; however, on the test set, its accuracy drops to 0.57. The SVM obtained 100% accuracy for training; however, on the test set, it significantly fell. RF and ET showed balanced accuracy on both sets. GBM achieved perfect accuracy on the training set but dropped slightly on the test set. In general, GBM presented high accuracy but at the cost of longer training times. In relation to hyperparameter tuning, LightGBM revealed a comparable performance.
On training and testing datasets, the ridge model scored 100% on accuracy, precision, and recall, suggesting outstanding results. On training and testing datasets, GBM also scored well, reaching perfect accuracy as well as high precision and recall. LightGBM did well overall, displaying high accuracy, precision, and recall on the testing dataset in particular.
Table 7 presents the best hyperparameter tuning for 70% training and 30% test splitting.
A cost–benefit assessment can help with the trade-off between computational efficiency and accuracy. The Markov blanket approach was useful in feature selection, reducing the features from 576 to 56, while retaining features that are significant to the model, reducing dimensionality, training time, memory usage, and increasing computational efficiency. As datasets grow, scalability becomes critical. Cost–benefit analysis based on the performance of classification models concerning training duration, validation duration, accuracy, and scalability is relevant.
Table 1,
Table 4, and
Table 6 illustrated that while certain models, such as GBM and LightGBM, achieve high accuracy and F1-scores, their computational time varies significantly, especially as the dataset size shifts from a 90/10 to a 70/30 training/test split. High test accuracy is often attained by tree-based models such as GBM and LightGBM (up to 1.00 in
Table 1 with a 90/10 data split). However, it is noted that as the size of the training data increases, the computation time also increases. For example, the training time of the GBM model increases from 97 to 122 s.
Models such as ridge and k-NN present lower computation time; however, such models often have difficulty in preserving performance. Additionally, high-performing models need more processing time as the test set size grows, emphasizing scalability as an essential challenge regarding handling bigger datasets. All data splits indicate comparatively quick processing times for computationally simpler models like ridge and naive Bayes. Even though these classification models are not as accurate as advanced tree-based ensemble learning models, such as GBM or LightGBM, they may be suitable in situations where efficiency is considered more important than accuracy.
Table 8 presents classification results for a stratified
k-fold cross-validation strategy, with the mean (
) and standard deviation (
) reported for
k = 5. Models with 100% accuracy, recall, precision, and F1-score included
k-NN, SVM, GBM, and LightGBM. With highly accurate and balanced precision and recall, ridge, RF, and ET all work well. Lower performance can be observed with QDA, NB, and MLP, especially in terms of accuracy and recall values. When compared to GBM and MLP, SVM and ET have comparatively faster computation.
According to the analyses presented here, the best performance results were achieved by the RF, GBM, and LightGBM models. One of the reasons RF often outperforms other classifiers is due to their use of bagging. In bagging, multiple decision trees are trained on different random subsets of data, and their predictions are aggregated. This reduces variance and mitigates overfitting, which makes RF robust compared to individual decision trees that tend to overfit the data. GBMs build trees sequentially, with each tree trying to correct the errors of the previous one. This boosting process allows GBMs to focus on the difficult-to-predict samples, reducing the bias in the model [
61]. LightGBM builds trees based on the gradients of the loss function. It efficiently handles large datasets by using a histogram-based approach, which speeds up training and reduces memory consumption. Unlike traditional gradient boosting methods that grow trees level-wise, LightGBM grows trees leaf-wise. This approach can lead to a more complex tree structure and can capture interactions better, potentially improving predictive performance [
62].