1. Introduction
Attention-deficit/hyperactivity disorder (ADHD) is one of the most common childhood psychiatric conditions, estimated to affect around 5% of children worldwide [
1]. The core symptoms include inattention, hyperactivity, impulsivity, and emotional dysregulation, which impair daily functioning. Although standardized clinical evaluations are the gold standard for diagnosis, ADHD assessment remains challenging due to subjective biases and a lack of definitive biomarkers. Prior research has sought to augment behavioral assessments using neurophysiological features from electroencephalogram (EEG), functional magnetic resonance imaging (fMRI), and genetics [
2]. In particular, advanced analytical techniques applied to non-invasive EEG measurements provide promise for objective ADHD diagnostics [
3,
4,
5,
6]. Raising Children Australian Parenting reported that during early childhood, the brains of children experience a significant period of rapid development. By the age of six, the cerebral development of individuals reaches around 90–95% of the size seen in adults. During the first stages of development, the early years play a vital role in shaping the brain. However, it is important to note that substantial restructuring is still required for the brain to attain the level of functionality seen in adult brains [
7]. The human brain may be conceptualized as a complex and extensive network that efficiently regulates the whole body. The anatomical development of neural tissue in the brain undergoes changes from childhood to adolescence, which are accompanied by alterations in oscillatory patterns and brain imaging data. These changes can be measured using both EEG and fMRI, as demonstrated in studies conducted by Smit et al. [
8,
9] and Power et al. [
10]. Furthermore, these measurement techniques can be used to compare patients with healthy individuals. The brain network of individuals diagnosed with ADHD exhibits several anomalies and divergences when compared to the brain network of neurotypical individuals. These developmental problems have been identified by fMRI assessments, as reported by Tang et al. [
11]. Numerous EEG investigations have shown atypical amplitude patterns in the brain waves of individuals with ADHD [
12,
13,
14]. Individuals diagnosed with ADHD exhibit distinct EEG patterns that indicate deviations in neuropsychological functioning when compared to neurotypical individuals. These differences can be effectively identified through the application of machine learning (ML) algorithms, which are recognized as a valuable approach for analyzing complex datasets [
15,
16,
17]. EEG signals provide intrinsic benefits, such as universality, uniqueness, affordability, and accessibility, when compared to other biometric measures [
18,
19]. Consequently, EEG devices may be conveniently used in many settings, including educational and medical institutions.
Consequently, the use of artificial intelligence (AI) techniques has been proposed as a means to mechanize the procedure and they serve as a tool for aiding in the examination and identification of mental disorders. The aforementioned methodologies may be categorized into two distinct subfields within the domain of artificial intelligence, namely machine learning (ML) and deep learning (DL), with the latter being a subset of the former [
20,
21,
22]. Classification is a prominent problem within the field of EEG and its application to mental diseases. This implies that an ML model utilizes diverse properties derived from EEG data as input and produces a prediction, such as the presence or absence of a mental disorder in a patient. Feature extraction (FE) was employed to obtain the input characteristics from the unprocessed EEG data. The ability to extract and choose a suitable set of features for a certain problem is a crucial factor, as it may determine the usability and effectiveness of an ML model. In other words, it may be argued that FE has significant importance, especially in the context of data analysis such as EEG.
The limitations of the current systems prevent their performance in classifying AHDA using the EGG dataset. This is evident from the findings of previous studies [
23,
24], which reported an accuracy of 93.91% and 91% using SVM and graphic neural networks based on the accuracy measures. Therefore, we have built an upgraded system aimed at boosting the accuracy of the existing method. The main contribution of this proposed research is drawn below:
This study implemented a comprehensive ML and DL pipeline using EEG data to classify ADHD accurately from healthy brain function.
Raw multichannel EEG recordings from 61 ADHD and 60 control children performing a visual attention task were utilized. The rigorous preprocessing, time-frequency feature extraction, feature selection, classifier optimization, and validation techniques are applied to enhance the classification algorithms.
The ML and DL models have been developed to detect ADHD based on features obtained from the feature selection methods.
We demonstrated the efficacy of combining EEG biomarkers and sophisticated classification algorithms in robust ADHD detection compared with different existing systems.
The methodology and results establish guidelines and performance benchmarks to inform future research and the translation of these techniques to improve clinical practice.
2. Background of the Study
The examination of EEG characteristics associated with ADHD has attracted considerable attention, resulting in a substantial body of research [
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37]. The bulk of studies in this field primarily investigate frequency-domain indicators, which often include absolute and relative power estimates across different frequency bands or power ratios across various frequency bands [
38,
39,
40,
41,
42,
43,
44]. Although these approaches are computationally efficient and visually interpretable, they lack the ability to evaluate the nonlinear characteristics of EEG brain dynamics. Researchers have used techniques derived from nonlinear dynamics and chaos theory to investigate the nonlinear characteristics of brain dynamics. The measurement of EEG coherence provides significant insights into the functional connectivity between different regions of the brain. These nonlinear measures capture unique facets of localized brain dynamics and the synchronization interplay between different brain regions. With applications spanning no-task resting states, perceptual processing, cognitive task execution, and various sleep stages, nonlinear time series evaluations of EEG and MEG have provided insights into the brain’s fluctuating dynamics [
45]. Nevertheless, coherence is insufficient for defining nonlinear interdependencies, especially when it comes to nonstationary time series. Nonlinear synchronization techniques are used in lieu of conventional methods to facilitate the investigation of functional brain connectivity [
25].
According to Stam et al. [
39], distinct patterns of brain activity exhibit distinct chaotic dynamics. The dynamics under consideration may be described by nonlinear measures, such as entropy and Lyapunov exponents. Research has shown that the use of the approximation entropy metric is particularly advantageous in the characterization of short time series that are affected by noise. The aforementioned capability allows for the provision of a dependable evaluation of dynamical complexity that is not reliant on specific models and is grounded in information theory [
36]. References [
40,
41] are provided. Numerous studies have shown that brain activity, as a highly intricate dynamic system, has a multifractal organization. Previous research has shown the efficacy of using fractal analysis of EEG time series as a viable approach for elucidating the neural dynamics associated with sleep [
36]. A study conducted by Fetterhoff et al. [
43] revealed that the multifractal firing patterns seen in hippocampal spike trains exhibited increased complexity during the performance of a working memory task by rats. However, these patterns undergo a significant decrease when rats suffer from memory impairment. Zorick et al. [
41] showed that multifractal detrended fluctuation analysis has the potential to impede an individual’s ability to perceive changes in their state of consciousness.
Feature extraction is a fundamental technique in digital signal processing. It involves selecting an appropriate analysis domain, such as time, frequency, or space, then using mathematical functions to derive synthetic and highly informative values from the input signals. The feature extraction methodologies used in electroencephalographic (EEG) investigations focused on the diagnosis and treatment of ADHD in pediatric populations. The researchers were conducted at the executive function level in order to examine the effort involved in identifying neurocorrelates of a diverse range of illnesses, such as ADHD. In some cases, the characteristics that are retrieved in this manner may undergo further transformation and/or calibration in order to enhance the process of detection or classification [
46,
47,
48].
The researchers have used several techniques for feature extraction in the analysis of EEG data. These techniques include statistical features and deep-learning-based features, which have been extensively utilized [
49,
50,
51,
52]. The ADHD may also be diagnosed using EEG data, hence necessitating the extraction of characteristics from these signals [
53,
54]. The linear and non-linear characteristics are extensively used for the purpose of diagnosing youngsters afflicted with ADHD [
55], whereas a range of morphological, time domain, frequency, and non-linear properties were extracted from EEG signals in order to facilitate the diagnosis of ADHD in children. Alt1nkaynak et al. [
56] used the utilization of morphological, non-linear, and wavelet characteristics as diagnostic tools for the identification of ADHD in children. In the present investigation, we have further derived temporal domain, morphological, and non-linear characteristics based on prior research [
57]. Some researchers used alterations in power that measure by persuing the theta/beta ratio (TBR). This characteristic has been proposed in a number of studies [
58,
59,
60,
61,
62]. However, TBR has limitations as a universal ADHD diagnostic marker. Elevated TBR is not evident in all ADHD patients, while non-ADHD individuals may also demonstrate heightened ratios [
58,
62]. Moreover, factors like fatigue or medication can confound TBR, underscoring the need to consider influencing variables. Nonetheless, within a holistic assessment, TBR remains a widely studied potential EEG biomarker warranting ongoing scientific attention.
Some researchers have used the feature selection approaches for the identification of putative characteristics associated with ADHD. The process of feature selection is important as it eliminates redundant features and improves the performance of machine learning (ML) and deep learning (DL) models. The feature selection methods are used to mitigate overfitting issues in the training/testing process. Within the existing body of literature, numerous feature selection techniques have been employed, like PCA [
63,
64] minimum redundancy maximum relevance (mRMR) [
65], mutual information (MI) [
66,
67],
t-test [
56,
57], support vector machine recursive elimination (SVM-RFE) [
65], least absolute shrinkage and selection operator (LASSO) [
57], and logistic regression (LR) [
57]. Khoshnoud et al. [
64] used PCA as a technique for reducing the dimensionality of the data. Through this process, they were able to select characteristics that had a high degree of correlation with one another.
The DL and ML methodologies have gained significant traction in several real applications, such as medical imaging [
62] and time series analysis [
49,
68,
69]. The ML techniques have been extensively used to differentiate ADHD from a control group of healthy individuals [
56,
57,
63,
65,
70,
71,
72,
73]. In a study by Muller et al. [
44], a set of five classification models was used. These models consisted of logistic regression, support vector machine (SVM) with a linear kernel, SVM with a radial basis function kernel, random forest (RF), and XGBoost. The models demonstrated sensitivities ranging from 75–83% and specificities ranging from 71–77%. The variables used in this research included the conditions of closed eyes, open eyes, and visual continuous performance test signal power throughout various frequency ranges. Additionally, the study examined the amplitudes and latencies of event-related potentials (ERPs). One possible explanation for the suboptimal efficacy of identifying ADHD lies in the inadequate selection of features for the models. One of the prevailing EEG features often seen in individuals with ADHD is an elevation in power at low frequencies, namely in the delta and theta bands, as well as a reduction in power at high frequencies, particularly in the beta band. In the majority of ADHD detection studies, nonlinear characteristics were retrieved by the authors and then identified using common classifiers, such as SVM, multilayer perceptron, and KNN [
74]. In this study, researchers conducted experiments using deep convolutional neural networks and DL networks to assess the diagnosis of ADHD in both adult and pediatric populations [
75].
Table 1 summarizes systems-based ML and DL models for detecting ADHD.
By effectively implementing prompt intervention and precise diagnosis, it is feasible to modify neuronal connections and improve symptomatology. Nevertheless, due to the many characteristics of ADHD, as well as its coexisting conditions and the limited availability of diagnostic professionals universally, the identification of ADHD is often put off. Therefore, it is vital to take into account new methods to enhance the effectiveness of early detection, like the use of ML and DL models. The research gaps that have been found pertain to the performance of existing systems. In the present work, we have examined supplementary characteristics and have used a varied range of ML and DL models in order to enhance accuracy. Furthermore, it is important to find the appropriate feature extraction approaches for the outcomes.
3. Materials and Methods
This research proposes a novel methodology for automated classification of pediatric ADHD from EEG signals. The approach comprises a systematic pipeline with each phase carefully designed to contribute to a robust performance, as depicted in
Figure 1. Initially, meticulous preprocessing of the raw EEG recordings is performed through filtering and artifact removal techniques. These are critical to isolating the neurophysiologically relevant signals from potential biochemical and environmental contaminants and establishing a firm foundation for subsequent analysis. Informative features are then strategically extracted from the preprocessed EEG data to enable a nuanced characterization of the brain dynamics. The set of time, frequency, entropy, and power signal features provide a comprehensive encapsulation of the salient neural properties. This phase transforms the data into an informative representation suitable for machine learning. The feature set is further refined through rigorous selection techniques to retain only the most diagnostically relevant variables. By eliminating redundant and uninformative features, the efficiency, generalizability, and interpretability of later processes can be enhanced. Subsequently, optimized models are developed that are tailored to effectively learn from the EEG feature space. Model hyperparameters and architectures are tuned to maximize classification performance on these specific neural data. Finally, rigorous benchmarking on unseen holdout test data provides unbiased insights into real-world effectiveness. Multifaceted metrics quantitatively validate the methodology’s strengths and limitations on this crucial diagnostic task. This strategic methodology holds substantial promise for enabling robust EEG-based ADHD classification. Each phase addresses a key aspect of the overall pipeline, working synergistically to unlock the full potential of data-driven analytics on these neural signals. Detailed empirical evaluations in this paper demonstrate the methodology’s capacity to instigate major advances in computational healthcare.
3.1. Participant Recruitment and EEG Data Acquisition
The dataset consisted of EEG recordings from 61 children diagnosed with ADHD (48 males, 13 females; mean age 9.62 ± 1.75 years) and 60 controls (50 males, 10 females; mean age 9.85 ± 1.77 years), as represented in
Figure 2.
The participant cohort for this study comprised 61 children diagnosed with ADHD recruited through psychiatric referrals at Roozbeh Hospital in Tehran, Iran. Clinical evaluations were conducted by experienced child and adolescent psychiatrists to confirm DSM-IV diagnoses of ADHD based on established criteria (American Psychiatric Association [APA] [
91,
92]). Adherence to standardized DSM-IV guidelines ensured consistent and reliable ADHD assessment across all subjects. The control group consisted of 60 healthy children, with 50 males and 10 females, selected from two Tehran primary schools following psychiatric verification of no neurological disorders.
EEG signals were recorded using a 19-channel system (SD-C24) at a 128 Hz sampling rate with 16-bit analog-to-digital resolution [
93]. During recording, participants engaged in a visual sustained attention task responding to a series of images. Each image contained between 5 to 16 randomly positioned age-appropriate animal cartoon characters. Per the experimental protocol, images were presented uninterrupted and immediately following responses to maintain engagement throughout the recording session [
93]. The consequent variable-duration EEG recordings depended on individual response speeds. Since the trial length depended on the time taken by each child to count the animals and provide their response, the total trial duration varied across subjects. The minimum trial length was 50 s for one control subject, while the maximum length was 285 s for one subject with ADHD. No additional incentives or penalties were provided linked to performance [
93].
This cohort and experimental design allowed the collection of multi-channel EEG data from a sample of well-characterized and matched ADHD and control subjects undertaking a clinically-relevant cognitive task known to elicit ADHD-related neural patterns.
The 19 active EEG electrodes were positioned on the scalp according to the internationally standardized 10–20 system [
93]. This allowed reliable coverage of frontal (Fp1, Fp2, F7, F3, Fz, F4, and F8), central (C3, T3, C4, and T4), parietal (P3, Pz, and P4), temporal (T5 and T6), and occipital (O1 and O2) sites. Reference electrodes were placed on the left (A1) and right (A2) earlobes.
Table 2 presents channels and their corresponding regions on the scalp, and
Figure 3 illustrates the 10–20 electrode locations, which optimized the recording of brain dynamics across cortical regions relevant for EEG analysis.
Electrodes were positioned on the scalp per the 10–20 international standard, allowing reliable coverage of frontal, central, parietal, temporal, and occipital regions. Reference electrodes were placed on the left (A1) and right (A2) earlobes.
Figure 3 illustrates the electrode locations that optimized the recording of brain dynamics relevant for EEG analysis.
3.2. Preprocessing of EEG Signals
The raw EEG signals required extensive preprocessing before analysis to isolate clinically relevant neural activity. The multi-stage preprocessing pipeline consisted of:
3.2.1. Digital Filtering
The continuous EEG time series was filtered to remove frequencies outside typical neural bands that represent noise:
- -
Bandpass filter (0.5–63 Hz): Removes very low and very high frequency components outside the primary EEG range of interest. Attenuates unwanted noise outside this bandwidth.
- -
Notch filter (49–51 Hz): Removes power line interference at 50 Hz specifically. This narrow stopband targets just the 50 Hz noise while preserving nearby EEG content.
- -
Butterworth response: Used for both the bandpass and notch filters. Provides a maximally flat frequency response in the passband to avoid distortion of EEG frequencies.
- -
4th order filters: Higher order improves roll-off steepness of the filters. This allows sharper attenuation at cutoffs for the band pass and tighter rejection of 50 Hz in the notch filter.
- -
Zero-phase filters: These are applied to the Butterworth filters to prevent phase distortion. This filtering approach processes the input data in both forward and reverse directions, eliminating phase shifts and ensuring the filter output is aligned to the input.
3.2.2. Frequency Band Separation
The filtered EEG signals were decomposed into conventional delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), and beta (13–30 Hz) frequency bands using 4th-order Butterworth bandpass filters. This enabled examination of ADHD-related neural oscillations in each band.
The delta, theta, alpha, and beta bands cover the spectrum of normal neural oscillations observable in surface EEG recordings and have formed the focus of most ADHD neurophysiological research. Delta waves (0.5–4 Hz) are linked to deep sleep and unconsciousness. Theta waves (4–8 Hz) are associated with drowsiness and memory recall. Alpha waves (8–13 Hz) indicate wakeful relaxation. Beta waves (13–30 Hz) are linked to active concentration and problem-solving.
Critically, these four bands show robust differences in ADHD populations compared to controls. Elevated theta and reduced beta band power are considered diagnostic biomarkers of ADHD. The alpha fluctuations index impairs attentional processes in ADHD. Delta waves also show atypical activity in ADHD. In contrast, higher frequency gamma waves are more difficult to characterize in surface EEG and less utilized in ADHD research.
However, theta/beta ratio (TBR) has limitations as a universal ADHD diagnostic marker. Elevated TBR is not evident in all ADHD patients, while non-ADHD individuals may also demonstrate heightened ratios. Moreover, factors like fatigue or medication can confound TBR, underscoring the need to consider influencing variables. Nonetheless, within a holistic assessment, TBR remains a widely studied potential EEG biomarker warranting ongoing scientific attention.
3.2.3. Signal Segmentation
The continuous raw EEG signals recorded during the visual attention task ranged in duration from 50 to 285 s for each subject. To enable extraction of transient features from the continuous EEG recordings, it is standard practice to partition the time-series into fixed-duration segments. The selection of an appropriate window length involves balancing sufficient signal characterization against attaining adequate samples. In this work, the multi-channel EEG data were segmented into 2-second epochs to analyze localized waveform patterns. Furthermore, 50% overlap between windows was implemented, so each 2-second segment shared 1 s with the prior. This overlap enhances continuity between successive windows. The 2-second duration provides a reasonable compromise between encapsulating salient EEG phenomena while yielding sufficient samples from even the shortest recordings. This aligns with typical EEG analysis values. Specifically, a 2-second window balances capturing transient neural events against characterizing slower oscillatory dynamics. Overly short windows risk missing informative lower frequency patterns, while long windows can smooth valuable fast fluctuations. A 2-second window enables the differentiation of standard EEG bands while providing stable feature calculations. Many event-related potentials and EEG processes of interest manifest over a few seconds, making 2-second segmentation suitable for capturing their time course. Computationally, this window size is efficient and reduces boundary effects sometimes seen with narrow windows. The 2-second window with 50% overlap achieves an effective balance between transient and oscillatory data, frequency resolution, statistical stability, capturing EEG phenomena time courses, computational demands, and boundary effects.
3.3. Feature Extraction
This study extracted a set of 11 time-, frequency-, and information-theoretic features from each two-second epoch of the preprocessed EEG signals across each frequency band and electrode. The time-domain features [
94] captured important statistics about the waveform amplitude and distribution in the time series. Specifically, the mean, variance, skewness, and kurtosis were calculated. The mean and variance provide information about signal energy and variability over time. Skewness indicates the asymmetry of the distribution, while kurtosis measures the heaviness of the tails. For frequency features [
95], the Hjorth parameters of activity, mobility, and complexity were derived. The activity reflects the signal power, mobility represents the proportion of faster to slower frequencies, and complexity quantifies variability and change in the frequency domain. Shannon entropy and spectral entropy measured the unpredictability and information content of the signals, with higher values indicating more randomness. Power spectral density entropy assessed the flatness versus peak frequency of the power distribution. Additionally, the relative power in each frequency band was computed as the ratio of the absolute power in that specific band to the total power across the bands. This helps quantify the contribution of each EEG rhythm. In total, these 11 features were extracted for each of the 19 channels in the delta, theta, alpha, and beta bands, resulting in 836 feature variables for each two-second window. By concatenating the feature vectors, the high-dimensional EEG time series data were transformed into a consolidated set of informative features for input into the ML classifier.
The selection of features, encompassing statistical, spectral, entropic, and power-related attributes, was purposefully designed to capture the diverse time-domain, frequency-domain, and information-theoretical characteristics of the EEG data. The ultimate goal was to use these features to effectively differentiate between experimental conditions (ADHD vs. Control) using machine learning algorithms.
3.4. Feature Selection
After extracting a substantial feature set from the EEG data, two approaches were pursued for feature selection to derive an optimal diagnostic biomarker subset. The first technique applied a sequential wrapper-filter process. Recursive feature elimination (RFE) was utilized as the wrapper method for feature selection in this study. RFE operates by iteratively training a model, ranking features by importance, and pruning the least important features until the desired number remains. Specifically, a random forest (RF) model was employed within RFE to evaluate feature significance. The RF ensemble consisted of 100 decision trees, each trained on a subset of the data and features. Feature importance was determined based on the decrease in impurity (Gini criterion) conferred across trees. At each RFE iteration, the full set of 836 extracted features was used to train the RF model. The 10% least important features based on the ranking were then eliminated, and the process repeated until 95% of the original features were retained. This reduced feature subset was subsequently passed to principal component analysis (PCA) as the filter method for further dimensionality reduction. PCA transforms correlated features into a smaller set of orthogonal principal components that account for maximal variance. PCA was applied, retaining 95% of the explained variance, yielding the final feature set used for classification. The combination of RFE for initial feature pruning followed by PCA provided a robust data-driven approach for feature engineering. RFE removed irrelevant and redundant features, while PCA identified key patterns and reduced overfitting. This integration of wrapper and filter methods enabled optimal feature selection from the high-dimensional EEG biomarkers [
96,
97].
The second approach Involved employing least absolute shrinkage. Selection operator (LASSO) regularization is a technique that performs both features selection and regularization to enhance model generalization and interpretability. Unlike standard regression that minimizes the residual sum of squares, LASSO adds a penalty term to the loss function equal to the sum of the absolute values of the coefficients multiplied by a tuning parameter lambda. This penalizes model complexity and shrinks coefficients towards zero. As lambda increases, more coefficients are shrunk to exactly zero and eliminated from the model. This inherently performs embedded feature selection, removing uninformative variables. Only features with nonzero coefficients are retained, ideally identifying the most relevant biomarkers for the task. Therefore, LASSO serves a dual role—the L1 regularization helps prevent overfitting, while the coefficient shrinkage enforces parsimony by selecting a sparse feature subset.
We chose these approaches as the most effective methods for selecting the best features from the EEG data after the ADHD feature extraction. This feature selection process was vital in enhancing the accuracy and efficiency of our model, ensuring that we could derive a concise and highly informative set of biomarkers for ADHD classification.
3.5. Data Partitioning and Balancing
In the process of evaluating model performance, the consolidated dataset was systematically divided into separate training and testing subsets, utilizing an 80/20 stratified division. Specifically, 80% of the data was earmarked for model training, while the residual 20% was reserved for the test set to facilitate an unprejudiced assessment of the model. Upon examination, the original dataset revealed a class imbalance, with a disproportionate number of samples in the ADHD group relative to the control group. It is noteworthy that the duration of each trial was contingent on the time each child took to enumerate the animals and register their responses, resulting in variability in trial lengths. The shortest trial duration recorded was 50 s for a control subject, whereas the longest spanned 285 s for an ADHD participant. Furthermore, the dataset comprised 61 ADHD subjects and 60 controls. Given the observed asymmetry, it was imperative to rectify this bias to ensure rigorous model training. To ameliorate this, the Synthetic Minority Over-Sampling Technique (SMOTE) was judiciously applied to the underrepresented control class within the training data. The SMOTE algorithm operates by first identifying the k nearest neighbors for each minority class sample based on proximity within the feature space. Next, one of these k neighbors is randomly chosen, and a new synthetic sample is computed along the line segment joining the minority sample and its selected neighbor. This process is repeated until the minority class representation matches the desired prevalence.
This method generated synthetic samples, effectively equilibrating the representation of control and ADHD classes in the training subset. It is paramount to emphasize that the application of SMOTE was circumscribed solely to the training data, safeguarding against potential biases in the test data. The undisturbed test subset sustained the intrinsic class imbalance, a strategic decision made to gauge the model’s generalizability to real-world, imbalanced scenarios. This inherent imbalance also furnished an unvarnished evaluation of the model’s aptitude in handling the intrinsic challenges of the dataset. The judicious application of SMOTE to balance only the training dataset facilitated robust model tuning, while the unaltered test set ensured a candid evaluation of model performance on genuine imbalances. This meticulous approach mitigated biases, offering a transparent view of model efficacy and significantly enhancing the models’ performance.
3.6. Machine Learning Algorithms
The automatic detection and classification of ADHD versus healthy controls was performed by applying various ML and DL algorithms, including the decision tree model, AdaBoost model, gradient boosting model, extra trees model, RF model, LightGBM model, CatBoost model, KNeighbors model, multilayer perceptron (MLP) model, CNN-LSTM model, LSTM-transformer model, and CNN model, to EEG biomarker datasets.
This study implemented and evaluated a diverse set of ML algorithms for EEG-based ADHD classification, including both single models and ensemble techniques. Specifically, a single decision tree model was tested as a baseline nonlinear classifier that makes predictions by recursively partitioning the feature space based on optimal splits. For ensemble learning, the study utilized adaptive boosting (AdaBoost), which combines multiple weak learners in a sequential manner, focusing on misclassified instances. Gradient boosting was also used, which produced an additive ensemble model minimizing a loss function via gradient descent optimization.
In addition, extra trees and the RF ensemble method [
95] were applied, both of which aggregate predictions across multiple randomized decision trees to improve generalization capability. LightGBM, a gradient boosting framework optimized for efficiency with high-dimensional data, and CatBoost, a boosting technique adept at handling categorical variables, were also implemented.
Finally, KNN, a simple yet effective algorithm that predicts proximity in feature space, was evaluated [
96]. This diverse ensemble of single and ensemble tree-based models and distance-based techniques aimed to thoroughly evaluate a wide range of modern ML approaches for EEG-based ADHD classification.
Table 3 shows the important parameters of the ML model for detecting ADHD.
3.7. Deep Learning Algorithms
The advanced DL models proposed for the classification of ADHD using EEG, including the MLP model, CNN-LSTM model, LSTM-transformer model, and CNN model, were applied to the EEG biomarker datasets.
3.7.1. Convolutional Neural Networks (CNNs) Model
A one-dimensional convolutional neural network (1D CNN) architecture was developed for EEG-based classification of ADHD in this study [
97]. The tailored CNN model comprised multiple layers optimized for learning salient features and patterns from the EEG biomarkers to accurately discriminate ADHD cases. Specifically, the model contained two 1D convolutional layers with 128 and 64 filters, respectively, to capture distinctive spatial patterns along the temporal dimension within the extracted EEG features. The convolutional filters learned to recognize localized waveform motifs of diagnostic relevance from the raw biomarker time series. To mitigate overfitting, a dropout regularization layer (rate 0.5) and max pooling layer (pool size 2) were incorporated. The dropout randomly omitted units during training to prevent co-adaptation, while max pooling reduced feature map dimensionality by retaining only the most salient elements. The CNN model then passed the extracted features through fully connected layers, including a 1024-unit layer and a 128-unit layer, with rectified linear unit (ReLU) activation to introduce non-linearity. Finally, a softmax output layer provided binary classification probabilities for ADHD versus control. The model was compiled using categorical cross-entropy loss and the Adam optimizer. Performance was evaluated by classification accuracy on held-out data. Early stopping with patience for 10 epochs avoided overfitting during training. The CNN model architecture is shown in
Figure 4, and the model parameters are shown in
Table 4.
3.7.2. Convolutional Neural Network, Long Short-Term Memory (CNN-LSTM)
This hybrid CNN-LSTM architecture utilized the strengths of both CNNs for feature extraction and LSTM for sequence modeling. The model starts with a one-dimensional (1D) convolutional layer with 128 filters and a kernel size of 3. The Conv1D layer can identify localized patterns in sequential EEG data using the sliding window approach. ReLU activation introduces nonlinearity to the convolved features. A second Conv1D layer follows with 64 filters and a kernel size of 3 to extract higher-level representations of spatial patterns. Stacking convolutional layers allows for the learning of hierarchical features. MaxPooling1D downsamples the feature maps by 2, reducing computational requirements while retaining the most salient features. The model then utilizes an LSTM [
98] recurrent layer with 100 memory units. LSTMs can learn long-range temporal relationships from sequential data, such as EEG. Furthermore, the feature maps are flattened into a 1D vector in preparation for fully connected layers. This condenses the data while preserving feature information. The final SoftMax output layer contained two nodes for binary classification into ADHD and control groups. The CNN-LSTM model architecture is shown in
Figure 5, and the model parameters are shown in
Table 5.
3.7.3. LSTM-Transformer Model
The developed neural architecture comprises a hybrid recurrent-transformer topology optimized for EEG-based ADHD classification. The model is structured into three key sections. The base of the network uses two sequential LSTM layers with 100 and 50 units to capture the temporal dynamics in the EEG input sequences. LSTM units [
99] contain memory cells and gates that enable the learning of temporal dependencies and long-range sequential patterns. The paired LSTM layers provide a robust foundation for modeling temporal information in EEG biomarkers.
Following the recurrent layers, a transformer block is applied that employs multihead self-attention to identify informative components across the EEG biomarker sequence. Self-attention draws global dependencies between sequence elements [
99]. The multihead mechanism runs parallel attention layers to focus on different parts of the sequence. This block has shown effectiveness in diverse sequence modeling tasks. Furthermore, the transformer contains feedforward layers with residual connections and layer normalization to stabilize activations. Residual connections propagate signals directly across network layers, while layer normalization rescales outputs for consistency. The output of the transformer is flattened into a 1D representation and passed to a final dense layer with Softmax activation to produce binary ADHD classification probabilities. The model is compiled with Adam optimization at a learning rate of 0.001 and categorical cross-entropy loss [
100,
101]. Early stopping halts training if validation loss shows no improvement for 10 epochs to prevent overfitting. The LSTM with multihead self-attention model architecture is shown in
Figure 6, and the model parameters are shown in
Table 6.
3.8. Multilayer Perceptron (MLP)
The perceptron is restricted to binary classification tasks using a simple linear predictor. The multilayer perceptron (MLP) provides a significantly more flexible architecture capable of modeling complex nonlinear relationships for both classification and regression problems. The MLP contains multiple layers of computational nodes, starting with an input layer to receive data, followed by one or more hidden layers, and ending with an output layer that produces the prediction. The addition of one or more hidden layers enables the network to learn sophisticated data representations and feature hierarchies directly from the inputs. The parameters of the MLP model for detecting ADHD are shown in
Table 7.
5. Discussion
ADHD is a neurodevelopmental issue that may potentially have detrimental effects on an individual’s sleep patterns, mood regulation, anxiety levels, and academic performance. Individuals diagnosed with ADHD may have enhanced facilitation in doing their routine tasks when promptly diagnosed and initiated on appropriate therapeutic interventions. ADHD may be diagnosed by neurologists via the analysis of abnormalities seen in the EEG data. EEG signals may exhibit complex, nonlinear, and nonstationary behavior. Differentiating subtle variations in EEG patterns between individuals with ADHD and those without ADHD may pose considerable difficulty.
The comparative classification results provide valuable insights into the utility of LASSO regularization versus RFE-PCA for EEG biomarker selection in ADHD detection models. The deep neural networks, including CNN architecture, demonstrated notable performance improvements on all metrics using the LASSO-derived feature set compared to the RFE-PCA. These consistent gains suggest that the LASSO biomarkers offered more useful information for DL-based ADHD classification, likely due to the selection of a sparse and diagnostically relevant feature subset. These comparative findings highlight the importance of tailored feature selection to match informative biomarkers with optimal model classes.
Figure 17 displays the receiver operating characteristic (ROC) percentages for the most effective machine learning algorithms when using the recursive feature elimination-principal component analysis (RFE-PCA) and least absolute shrinkage and selection operator (LASSO) approaches. It is worth mentioning that the K-nearest neighbors (KNN) algorithm with RFE-PCA achieved a ROC of 98%, while the ROC of KNN with LASSO reached 99%.
Figure 18 illustrates the ROC of deep learning models using RFE PCA. The results indicate that the CNN model with RFE-PCA achieved a score of 99%, while the CNN-LSTM model achieved a score of 95%. Ultimately, the study examined the comparative effectiveness of the RFE-PCA approach in conjunction with CNN for the detection and diagnosis of ADHD in children.
The ROC results of deep learning with the LASSO method are shown in
Figure 19. The CNN model with RFE-PCA scored 100%, and the CNN-LSTM model scored 95%. The research compared RFE-PCA and CNN for ADHD detection and diagnosis in youngsters.
Table 12 presents a comparison of the outcomes obtained from the ML and DL models in relation to several pre-existing systems. Benchmarking against prior EEG studies further validates the efficacy of the proposed pipeline. Alim et al. [
79] achieved 93.2% accuracy using an analysis of variance and PCA features with a Gaussian SVM model. Ekhlasi et al. [
80] obtained 91.2% accuracy with graph neural networks on theta and delta bands. By comparison, our CNN model attained 97.75% testing accuracy using LASSO-regularized biomarkers, showcasing the strengths of DL and tailored feature selection for unlocking discriminative information from complex neurophysiological data.
These results underscore the potential of our integrated framework, encompassing data preprocessing, class balancing, and custom feature selection techniques to extract maximally informative biomarkers from high-dimensional EEG for enhanced ADHD screening using deep networks.
The feature selection stage was critical for improving model performance by extracting key biomarkers from the extensive initial feature set. This process provided two main benefits—preventing overfitting and reducing complexity. By eliminating redundant, irrelevant, and noisy features, the models could focus on salient EEG variables that robustly distinguished ADHD from control patterns. Removing these uninformative features enabled better generalization and testing accuracy by retaining only meaningful biomarkers. Additionally, feature selection substantially decreased the dimensionality of the input space, lessening computational demands and training times. Simpler models with fewer features are also more interpretable, concentrating on core explanatory EEG markers.
In this work, RFE-PCA and LASSO regularization effectively refined the EEG features. The considerable performance gains after feature selection demonstrated the utility of these techniques for identifying concise yet highly informative feature subsets, enabling enhanced ADHD detection.
The comparative results also provided insights into matching feature selection approaches with machine learning algorithms. The deep networks showed notable improvements using LASSO-selected features versus RFE-PCA. Their consistent gains suggest the LASSO biomarkers offered more useful information for DL-based classification, likely due to identifying a sparse, diagnostically relevant feature subset. These findings highlight the importance of tailored feature selection to complement the strengths of different model classes. The LASSO features improved CNN 97%, displaying the value of an optimized feature selection approach.
6. Conclusions
In the context of this research, we delineate a novel computational architecture that ingeniously integrates ML and DL paradigms for the nuanced differentiation of ADHD profiles and normative developmental patterns in children, as discerned through meticulous EEG data analysis. This comprehensive infrastructure embraces a series of sophisticated processes, including data preprocessing, astute feature extraction, strategic feature selection, and advanced classification techniques.
The models conceived in this study stand as paragons of technological innovation, substantiating the transformative impact of refined ML pipelines in amplifying the precision of ADHD diagnostic mechanisms, thereby signaling a departure from traditional analytical methods. Leveraging the rich repository of the ADHD EEG dataset, our strategy unveils potent CNN, and MLP algorithms synergized with RFE-PCA for an optimized feature selection process, registering remarkable accuracy of 94.93% and 94.68. Moreover, our CatBoost and CNN models, orchestrated with Lasso methodologies, demonstrate a sterling accuracy metric, achieving 95.13 and 97.75, respectively.
These results profoundly underscore the quintessential role of feature optimization and meticulous data management in extracting clinically salient biomarkers from the complex labyrinth of EEG data arrays. Our tailored preprocessing, class equilibration, and feature selection techniques create a harmonized blend of RFE-PCA and Lasso methodologies, serving to foster a robust delineation between ADHD manifest patterns and standard neurological frameworks.
This study augments the burgeoning body of literature, emphasizing the promising synergy of EEG analytics and ML as vital adjuncts in facilitating nuanced clinical evaluations of ADHD. Our optimized models, envisioned as potent diagnostic allies, promise to confer reliable support for diagnostic trajectories, notably in delineating complex case scenarios, thus warranting further empirical validation across a broader spectrum of patient demographics. This trajectory potentially foretells the advent of earlier and more individualized intervention strategies, thereby enhancing the adaptive functioning and quality of life of individuals navigating the challenges of ADHD. In summation, our research initiative lays a seminal foundation for forthcoming translational ventures aimed at unlocking the maximal diagnostic potential of ML in tandem with neurophysiological data analytics in clinical arenas.