2. Materials and Methods
Implementing deep neural networks in autism spectrum disorder diagnosis can involve the following steps as shown in
Figure 1:
1. Data collection: The first step in using AI for ASD diagnosis is to collect a large amount of relevant data, such as patient symptoms, medical history, test results, and diagnosis information.
2. Data pre-processing: This step involves cleaning and transforming the data into a format suitable for models to process.
3. Model selection and training: Select an appropriate model and train it on the pre-processed data. The goal is to train the model to predict the diagnosis based on the input data accurately.
4. Model evaluation: Evaluate the performance of the model by testing it on a separate set of data and comparing the results to actual diagnoses. This step helps to determine the accuracy and reliability of the AI-based diagnostic system.
5. Explainability: Explainable AI concerns the ability to understand how a model arrived at a particular output and to what extent the output is trustworthy, resulting in increasing the transparency and accountability of machine learning models. This section discusses the above-mentioned steps employed to execute the experiment.
2.1. Dataset
DL models require a very large training dataset to reach a high-performance level. If the model is trained for almost all possible scenarios, the model’s accuracy will increase dramatically [
36]. In this experiment, the autistic children dataset from the Kaggle repository [
21,
22] was employed to train Deep CNNs, which is named Kaggle ASD in
Table 2. Kaggle ASD consists of a total of 3014 photos of children between the ages of 2 and 14, where most of them are 2 to 8 years of age. Though the number of facial images of males is three times that of the number of female populations, the ratio of the autistic and normal control class is 1:1. The contributor, Gerry Piosenka, gathered the photographs from different online sources. The dataset does not contain any medical profiling, the severity of illness, or the children’s nationality. A few images are subpar in terms of facial alignment, brightness, and image size.
2.2. Transfer Learning for ASD Diagnosis
2.2.1. Dataset Pre-Processing
To maintain the accuracy and consistency of results, the dataset used for model training should include an all-inclusive group of images that depicts all conceivable scenarios for extracting ASD diagnostic features. The prior literature taught us that the photos in the dataset have noisy backgrounds and alignment issues, affecting the DL models’ accuracy. To solve this issue, our dataset required cleansing and alignment [
37].
2.2.2. Align Dataset
A few steps are taken to align the images, and CNN is employed. Face recognition is performed as an auxiliary task using multi-task cascaded convolutional networks (MTCNNs), a deep CNN designed as both a face detection and alignment solution. The MTCNN consists of three stages of CNNs that can identify faces and landmarks, such as the eyes, nose, and mouth [
38]. A fully convolutional network (FCN) is in the initial stage of MTCNN, called P-Net. This FCN differs from ordinary CNNs in lacking dense layers at all stages of its architecture. Bounding box vectors are created around the desired face objects, and the overlapping regions are excluded to reduce the number of boxes. The CNN layer is required to further reduce the number of bounding boxes by merging the overlapped region employing non-maximum suppression (NMS). This CNN is called R-Net, which defines whether the input image is a face and returns ten element vectors to locate the landmarks of a face. The last stage, called O-Net, is very similar to R-Net, which returns the five landmarks of the face—the left eye, right eye, nose, left corner of the mouth, and right corner of the mouth. The first task of this process is face identification, where the cross-entropy loss for each sample is given by
where
gives the probability of sample
i = 0, 1,
…,
n being a face which the P-Net decides, and the
is the ground truth level.
For R-Net to create a bounding box, four corners of the box must be located, which is treated as a regression problem, and the Euclidean loss for each sample is calculated by
where
is the desired level obtained from the neural network and
is the ground level coordinate. For making the bounding box, four co-ordinate-like top, width, and height are required, so
.
In the last steps, the Euclidean loss is again minimized as per the below equation to formulate the task of face landmarks detection:
where
is the co-ordinates of facial landmarks—left eye, right eye, nose, left corner of the mouth, and right corner of the mouth—and
is the ground truth co-ordinate for the
ith number of images, and thus,
.
After detecting the left and right eye coordinates, we can obtain the angle
from the length of the triangle’s three sides. The length of each edge can be found from the Euclidean distance [
39]. Now the image has to be rotated anti-clockwise at an angle
as shown in
Figure 2a.
Algorithm 1 presents the pseudocode for image rotation, which takes an input path for images and an output path for rotated images. The algorithm proceeds for each image in the input path by detecting the face and landmarks using the detect_face and detect_landmarks functions, respectively. The x and y coordinates of the left and right eye landmarks are stored as xl,
yl,
xr, and
yr. The rotation angle
is then calculated using the arctan function with the expression
. Finally, the image is rotated by the angle
using the rotate_image function and saved to the output path using the save_image function.
Algorithm 1 Image rotation algorithm |
|
2.2.3. Crop Dataset
Cropping is an essential procedure for enhancing the aesthetic quality of digital photographs, as it removes undesired regions outside of a rectangular selection. The dataset includes facial images in the training set with a noisy texture and superfluous patterns in the background, which can impair the model’s training. Cropping is the process of removing a portion of an image to reframe it. Similar to the alignment procedure, MTCNN is used for face recognition, creating a bounding box around the face region; for cropping, the bounding box must be precise and tightly confined to the face region only. The upper left point of the bounding box is called the origin, and the lower right corner is called the end. The pixels inside the box are copied to a new image; thus, we obtain the cropped [
40] one as shown in
Figure 2b.
Algorithm 2 represents a face-cropping algorithm that takes the path for the input images and the output path where cropped images are stored. The algorithm reads each image from the input path and detects the face in each image using the function detect_face. Then, it calculates the
x and
y coordinates of the top left corner and the bottom right corner of the face’s bounding box using the function convert_xywh. The image is then cropped using the function crop_face, with the width and height calculated as the difference between the
x and
y coordinates. Finally, the cropped image is saved to the output path using the function save_image. The algorithm iterates through all images in the input path and applies the same process to each image.
Algorithm 2 Face-cropping algorithm |
|
2.3. Dataset Augmentation
Augmentation is a technique used to enhance the amount of an existing dataset by modifying and manipulating the original data. This strategy enables the model to discover or anticipate all possible real-time data patterns. Both pre-processed and unprocessed photos from the Kaggle ASD dataset are augmented. Different types of augmentations include horizontal flip, grey cycle, resize, rotation, shear, zooming, addition and removal of noise, changing brightness, hue, saturation, etc.
2.3.1. Horizontal Flip
When working with facial images, the horizontal flip is the most typical technique for augmentation. The human face is highly symmetrical, so a single-sided feature can frequently cause confusion during training. A complete horizontal flip allows learning the features from both sides of the face. Horizontal flip is obtained from the Transform module of the Torchvision library [
41]. The input images are fed from the PIL rather than tensors. The image’s width, height, and pixels are obtained from the PIL library using image functions, and then the image is transposed, as shown in
Figure 3.
Algorithm 3 is an image-flipping algorithm that takes an image path, an output path, and a probability as input. For each image in the input path, the algorithm reads the image and creates a new image by copying the original image into a new variable, img_f. The algorithm then loops through the width and height of the original image and checks if a random number is less than the given probability. If the condition is satisfied, the pixel value at position (x, y) in the new image is replaced by the original image’s pixel value at position (width − x − 1, y).
Finally, the new image is saved to the output path.
Algorithm 3 Image-flipping algorithm |
|
2.3.2. Add Pepper–Salt Noise
Noisy face images were identified as one of the primary causes of poor accuracy in previous investigations [
12]. Some image quality is so poor that the class cannot be predicted during testing. If noise can be added manually to the training set, the model will be able to learn and extract their features. Therefore, the PIL and NumPy libraries are utilized to alter the photos. The PIL opens the input facial images, which are converted into NumPy arrays. Additionally, the user must provide the probability or density of noise imposition. Random probabilities are compared to a threshold value for every image pixel in order to place noise in the required pixels [
42].
The salt-and-pepper noise algorithm, presented in Algorithm 4, takes an input path for images, an output path for the noisy images, and a probability parameter. The algorithm randomly assigns black and white pixels to generate salt-and-pepper noise on the input images. The probability parameter controls the degree of noise to be added to the input images. For each image in the input path, the algorithm reads the image, creates a new image, and then walks through each pixel. The random number between 0 and 1 is generated to compare with the threshold, which is set to be the complement of the probability. If the generated number is less than the threshold, the pixel is black (pepper noise); otherwise, if the number is less than the probability, the pixel is white (salt noise). Finally, the noisy image is saved to the output path.
Algorithm 4 Salt-and-pepper noise algorithm |
|
2.4. Convolutional Neural Networks
Since the 1980s, convolutional neural networks (CNNs) have been utilized in image classification and recognition [
43]. CNNs were initiated from the research of the brain’s visual cortex. CNNs have achieved superhuman performance on some challenging visual tasks in recent years because of the development in computer power, the quantity of data samples accessible for training deep neural networks, and transfer learning for user-modified classification [
44,
45]. The main function of the face-recognition or object-classification models is to extract an entity’s features, making it feasible for the binary classification of two classes, autistic and non-autistic faces, by acquiring knowledge from a vast collection of pictures by transfer learning approach [
46]. A machine learning method can be used for similar kinds of work by adapting changes in the pre-trained models’ top layers. The core convolutional layers of CNN-based models, which were previously trained with the ImageNet dataset, can be used to extract the features of autistic and normal faces. The classification layers are modified for binary classification. This study is based on three pre-trained deep CNN models: MobileNetV2 [
47], ResNet50V2 [
48], and Xception [
49]. These models were determined to perform best among contemporary works [
12].
2.4.1. MobileNetV2 Model
MobileNetV2 is a lightweight deep CNN model that is thus perceived to develop mobile phone applications to implement classification tasks. The basics of this MobileNetV2 model are to establish the connection from one bottleneck layer to another [
50]. The inverted residual architecture consists of 19 residual bottleneck layers; 32 full convolution layers exist before these layers. These convolution layers perform depth-based convolutions, utilizing non-linear filter characteristics.
2.4.2. ResNet50V2 Model
ResNet50V2 comprises several units that promulgate both forwards and backward direction mapping identities and are residual in nature. Through block-to-block propagation, classification accuracy is maintained at a high level. With the assistance of these residual mappings, training will be substantially easier and more generalized. In ImageNet or COC contests, ResNet models frequently have more than 100 layers and have outstanding accuracy.
2.4.3. Xception Model
This model has a very simple modular structure based on Google’s Inception model. The model comprises three primary blocks, entry, center, and exit, with separate convolutional layers and Relu activation functions for each block. The input image size is . The input is processed by the entry flow, which extracts features of dimensions. The residual connections take the maximum value of each layer as output after every block. In the middle block of the feature map, the feature map remains preserved despite being passed nine times through convolution layers. The output of the final component for a standard-size input image has 2048 features. Finally, the prediction layer receives the features via an FC layer, and the modifications are made to the final layers for binary classification.
2.4.4. Regularization
Some deep neural networks can have millions of parameters, although most have thousands. This allows DL networks unprecedented flexibility and the ability to accommodate a wide range of complex datasets. However, high adaptability increases the risk of model overfitting while training the dataset. Regularization is a process that can be implemented to avoid overfitting. Some regularization techniques are early stopping, batch normalization,
and
regularization, dropout, and max-norm regularization [
51]. Also, choosing the best optimizer is another factor that will help prevent overfitting.
In this paper, we use AdaGrad, which we obtained from the ablation study of the previous literature [
12]. Gradient descent first rushes down the sharpest slope, which does not lead directly to the global optimum, before slowly proceeding down to the valley floor. If the algorithm could rightly change the direction earlier, heading more directly toward the global optimum, that would be great. In order to make this adjustment, the AdaGrad algorithm scales the gradient vector according to the equation below:
Here,
s is a vector whose
i-th element is
and adds all the partial derivatives of the cost function in a square according to
. Thus, after each iteration,
becomes larger. The next equation almost refers to the gradient descent function, but here, the factor is kept less by factor
. This
is a vector that represents the
i-th element as
. Thus, it is evident that this algorithm can alter the learning rate for steep slopes considerably more rapidly than for gentle slopes. Consequently, this adaptive learning rate facilitates the model’s propagation towards the global optimal. So, minimal adjustment of the learning rate hyperparameter is required [
52].
Dropout is another regularization technique that can be used to address the overfitting problem [
53]. The dropout method is quite straightforward in that, at each layer, some neurons are disregarded so that they can be reactivated in the subsequent step. The maximum dropout rate is normally between 10% and 50%, with the probability
p controlling this rate. This is limited to 40% to 50% for CNN, whereas 50% dropout is employed in our transfer learning models just before the topmost decision layer. It considerably lessens the training load and prevents overfitting.
2.5. Explainable AI
Explainable AI is a branch of artificial intelligence that focuses on developing systems that can produce accurate predictions and provide human-understandable explanations of their decisions. Explainable AI aims to increase the transparency, accountability, and interpretability of machine learning models and algorithms [
54,
55].
Grad-CAM (gradient-weighted class activation mapping) is a visualization technique that can explain the predictions made by deep neural networks [
56]. Grad-CAM generates a heatmap that highlights the regions of the input image that are most important for the prediction made by the network. The heatmap is generated by computing the gradients of the output class scores with respect to the feature maps in the last convolutional layer of the network. These gradients are then used to weigh the feature maps, and the resulting weighted feature maps are averaged to produce the final heatmap. Grad-CAM can be used to visualize the internal workings of deep neural networks and explain the predictions made by the network. This can be particularly useful in medical imaging, where accurate predictions are important, but it is also important to understand why the network made a certain prediction.
When working with a large number of identical image samples for recognizing or extracting a given pattern, the mean image might play a crucial role [
57]. The mean image is simply the average of all images in the dataset. It is computed by summing up all pixel values of all images and dividing by the total number of images. The resulting image represents the average intensity of each pixel in the dataset.
The simple equation to calculate the mean image is
Input: A set of N images I_1, I_2, …, I_N
Output: The mean image M
M = (I_1 + I_2 + … + I_N) / N
The importance of mean images in explainable AI is increasingly recognized in the recent literature. One of the key benefits of the mean image in explainability is its ability to facilitate feature visualization [
58]. Feature visualization refers to the process of visualizing the features that the AI system learned during training. By subtracting the mean image, we can visualize the features that the AI system has learned, which can help interpret its decision-making processes. These visualizations can also be used to identify features important for classification, which can help improve the accuracy of the AI system.
2.6. Experimental Setup
The three primary phases of this research are (1) pre-processing and augmentation of the dataset; (2) optimizing the deep CNN models with hyperparameters; and (3) assessing the models with appropriate performance metrics. The success of artificial intelligence is highly dependent on optimal training and the quantity of high quality; thus, categorized datasets are a critical aspect in this regard. The more data algorithms have to work with, the faster they may learn and enhance their judgment of future outcomes. In order for machine learning to be effective, a large and diverse data collection must be analyzed. When sufficient high-quality data are available, AI systems can readily outperform baseline methods. Thus, in this study, the experiments are mostly data-centric, and the dataset is the main focus. It is well recognized that medical datasets are very difficult to acquire, making the number of datasets required to train DL models difficult. Here, the autistic children dataset from the Kaggle repository is used, named ASD Kaggle as stated in
Table 2. With a view to enhancing the training, the training set consisting of the facial images of a total of 2654 children (1327 ASD and 1327 normal) was pre-processed as described in
Table 3.
The pre-processing algorithms and techniques are explained at the start of this section. The images were first aligned and then cropped at the time of processing, as indicated in
Figure 4. The resulting set is named ASDp. This ASDp was fed through an algorithm that flips and adds noise to 2654 facial images, termed Flippedkp and Noisykp. Two additional datasets, Flippedk and Noisyk, were subjected to the flip and noise addition augmentations of Kaggle ASD, respectively as shown in
Figure 5, for the purpose of comparison. The details of all datasets are shown in
Table 3.
During training, we merged the Kaggle ASD with the full training set since the Kaggle ASD is the original uncured set we obtained (
Table 2), resulting in a twofold increase in the number of training sets (
Table 4). As an illustration, the flipped training set for the augmentation-only approach combines the Kaggle ASD and Flippedk datasets presented in
Table 3. Thus, the train column in
Table 4 gives the number of training samples we obtained after pre-processing or augmentation. The subscripts in datasets labeled as All for both the augmentation-only and augmentation with pre-processing approaches represent the combination of datasets listed in
Table 3 according to the provided Sl No. The test set and valid set used for testing and validation are the same across all experimental setups, as we wish to compare our results to those of contemporary research.
We employed the Deep CNN-based MobileNetV2, ResNet50V2, and Xception pre-trained using the ImageNet dataset of 4.2 billion photos of 1000 classes. The models were modified so that the prediction layer receives the features via an FC layer, and the modifications for binary classification were made to the final layers. The main reason for choosing these models is that among the recent research, only one performed a complete ablation investigation on five different models, and it was found that these three models performed the best. The hyperparameters listed below were also fine-tuned as part of the same research, and the same values were retained for this study.
We employed a handful of performance matrices to evaluate the model’s effectiveness. The most evident one is the binary classification accuracy stated as “accuracy”. One assessment matrix includes the area under the curve (AUC), used in some earlier studies to measure how well a model predicts outcomes. Since it is based on the ROC curve, this AUC is more convincing evidence of the model’s efficacy than accuracy alone. Precision and recall, the other two matrices, reflect the accuracy with which the desired classes could be predicted. Accuracy, precision, recall, and F1-score can be expressed mathematically as
3. Result
The codes were developed in Python and run on the Kaggle platform. The results were analyzed using several tools for data analysis, including matplotlib, sklearn, and Pandas. We trained the model using deep transfer learning from the Keras API Library [
59]. The performance of three distinct DLs (MobileNetV2, ResNet50V2, and Xception) was evaluated in terms of accuracy, precision, recall, and F1-score in this work using Equations (6)–(9). The DL networks were selected based on the ablation study conducted by M. S. Alam et al. (2022) while retaining their optimal hyperparameters and optimizer [
12] as shown in
Table 5. The batch size was set to 32, and Adagrade was utilized as an optimizer. The convolutional neural network (CNN) was trained for a maximum of 50 epochs, utilizing a learning rate of 0.001, in order to facilitate the effective learning and accurate prediction of samples associated with autism spectrum disorder (ASD). In the context of binary classification, the loss function selected is BinaryCrossentropy. This loss function is accompanied by the use of the ReLU activation and the sigmoid function in the final layer. The best values of several performance matrices derived from various data-centric approaches are displayed with bold fonts in tables. Our experiments are primarily data-driven. The initial part of this study is titled the augmentation-only approach because the training sets were generated by applying two fundamental augmentations to the Kaggle ASD dataset: flip and noise addition. As shown in
Table 4, the image was labeled as ‘0’ for normal control (NC) children and ‘1’ for ASD children while producing a data frame, and no dataset pre-processing was applied.
Table 6 summarizes the comparative training and test evaluation matrices of the deep learning models using the set of hyperparameters stated in
Table 5.
Flip augmentation was used for the Kaggle ASD training dataset, and the best training and testing performance was achieved. The ResNet50V2 model performed the best for training with a 99.9% accuracy and 100% AUC value, while the Xception model ranked first in testing with 92.5% accuracy and 97.9% AUC.
Figure 6 displays the training and validation accuracy along with the loss graphs for all three models. The training and validation curves clearly demonstrate that the model began to overfit beyond a certain point because the validation curve is ascending and descending while the training loss is constantly dropping. Clearly, the model cannot extract features for every potential scenario from these training and validation sets.
Table 7 demonstrates the performance of the evaluation matrices after the pre-processing of the Kaggle ASD dataset. The pre-processing flow is shown in
Figure 3, where the facial image is first aligned and then cropped, keeping only the facial region. The training accuracy is maximal when using ResNet50V2 with a value of 99.5% and an AUC of 100%. Similarly, for these approaches also, the testing performance is better while predicting ASD with the Xception algorithm. Accuracy is 97.9%, AUC is 99%, and the precision, recall, and F1-score are all reported to be 97.9%.
Figure 7 depicts the training and validation accuracy as well as the loss graphs for the three models with a pre-processing-only approach. Unlike
Figure 6,
Figure 7 displays a highly orderly increase in training accuracy over time. The graph of validation accuracy is parallel to the graph of the training accuracy. For the Xception model, these two lines tend to overlap to illustrate the consistency of the training and validation trend. For the training and validation loss graphs, the pre-processing-only approach yields symmetric plots, indicating that the prior approach’s overfitting is minimized.
After applying augmentation to a pre-processed picture dataset, the performance of models is outlined in
Table 8. The ResNet50V2 achieves the highest training accuracy, precision, recall, and F1-Score, while the AUC is reported to be 100%. The Xception demonstrates the highest testing accuracy, 98.9%, with a 99.9% AUC. The assessment matrices for this approach yielded the highest values among these three data-centric approaches.
Figure 8 depicts the accuracy and loss performance of the training and validation sets. For the Xception model, the growth of the training and validation graphs is highly similar, with a minor variation, and the curves are a perfect match. The ResNet50V2 model demonstrates superior performance when comparing training and validation loss while maintaining minimal overfitting. Overall, this experimental setup performs the best.
AUC is defined as the area under the curve, with a greater AUC indicating a greater likelihood of accurate prediction.
Figure 9 depicts the ROC curve of the best data-centric approach, whereas
Figure 9a depicts the curve for the flip augmentation approach without pre-processing,
Figure 9b illustrates the AUC of three different models for the pre-processing-only approach, and
Figure 9c depicts the AUC performance of the pre-processing with augmentation approach while mixing the All training set. The Xception model performs the best in terms of accuracy and AUC across all three approaches, indicating that the prediction rate for diverse test samples in the real-world scenario is greater.
Figure 10 depicts the confusion matrix for the 280 test samples, where blue boxes represent accurate predictions of the autistic or non-autistic classes, and white boxes represent incorrect predictions, i.e., individuals who were incorrectly identified as autistic or non-autistic despite belonging to the opposite class. In
Figure 10a, the first row displays the confusion matrices for the augmentation-only approaches of the ResNet50V2, MobileNetV2, and Xception models for flip augmentation, from left to right. Here, the performance of the models is abysmal, as Xception has a total of 21 misclassified test samples, which is the lowest of the models.
The confusion matrices for the pre-processing-only approaches of the ResNet50V2, MobileNetV2, and Xception models are displayed from left to right in the second row in
Figure 10b. Here, the performance of the models is significantly enhanced by the pre-processing of the training dataset, and the Xception model has the fewest misclassified test samples, with only six samples.
The confusion matrices of the ResNet50V2, MobileNetV2, and Xception models are depicted from left to right in the last row as stated in
Figure 10c for the augmentation with pre-processing approach while mixing the entire training set. Applying the augmentation after pre-processing on the training dataset results in the greatest model performance, and the Xception model has the maximum prediction accuracy with only three misclassifications.
Table 9 depicts the total number of incorrectly predicted samples throughout training and testing using various data-centric approaches. The total number of misclassifications is the sum of false positive and false negative classifications. MTr is the number of incorrectly predicted samples during CNN model training. MTs are the number of incorrectly predicted samples while evaluating the performance of a model using the unique test set. As the number of test samples is the same in all scenarios, the Xception model achieves the best results for all data-centric approaches, with only three mispredictions, when the training set is pre-processed, augmented (both flipped and noise added), and then completely synthesized.
Table 10 details the training duration for three distinct DL models. Te denotes the time required for model training per epoch in seconds. While training sizes vary amongst approaches, a new parameter, Tes, represents the training time per epoch per sample that the model requires. This parameter hints at which model is time-efficient for a certain data-centric approach. The training times are highest for Xception’s unprocessed dataset, at 18.23 milliseconds, and lowest for ResNet50V2 after pre-processing and augmentation, at 13.46 milliseconds. Xception’s training time is longer due to its complicated structure and huge number of layers, as well as its accuracy being the highest of any approach.
Table 11 displays the accuracy and loss results for the validation dataset. The graphical representation for each epoch is shown in
Figure 6,
Figure 7 and
Figure 8 for the three different approaches for the best accuracy performance. ResNet50V2 demonstrates the best result for the validation set with 100% validation accuracy and nearly no validation error. It demonstrates the quality of the model’s training and learning for feature extraction.
Explainable AI
Explainable AI refers to the ability of an AI system to explain its reasoning and decision-making processes in a way that is understandable to humans. Transfer learning, on the other hand, refers to the ability of an AI system to transfer knowledge learned from one task to another related task. These two concepts are related in that explainability can enhance the effectiveness of transfer learning by providing insights into the decision-making process of the AI system.
One method of explaining the decision-making process of a neural network is through visualization techniques, such as Grad-CAM. It highlights the most important regions of an input image that contribute to the neural network’s prediction. Thus, ASD classification based on facial images can be used to explain where the transfer learning models concentrate on extracting the ASD-specific features. By visualizing the acquired knowledge, we can comprehend how and where to focus on facial images, which can aid in debugging transfer learning models and discovering transferability restrictions.
Figure 11 depicts the facial feature region of autistic and non-autistic individuals. These two samples were selected at random from 280 test samples. Grad-CAM shows where the various models acquire the characteristics that characterize them as ASD or normal control children. The Grad-CAM heatmaps were built using the model weights from the data-centric models with the highest performance. With Xception, the primary focus area is between the eyes, whereas ResNet50V2 focuses mostly on the nose and lips for autistic children. For a normal child, the Xception and ResNet50V2 models focus mostly on the nose, the area below the nose, the area between the eyes, and the upper and lower lips. In contrast, the MobileNetV2 model focuses primarily on the upper nose sections, such as the eyes and upper nose.
While the test set contains numerous sorts of images, it is impossible to draw a simple conclusion about the results, as there are so many differentiating aspects, such as gender and face pose, which might lead to varied Grad-CAM outcomes. To generalize the focus areas for autistic and non-autistic children, we require a generalized image pattern, i.e., a mean image. The mean image is a critical component of the pre-processing pipeline in image classification tasks, which helps to normalize the data. It is composed of the average value of every pixel in a collection of photos. All the photos in the test set are either autistic or non-autistic. Each sort of image is distinguished by gender, as the demographics of male and female faces should differ. Moreover, images can be divided into two categories based on the position of face landmarks, namely the frontal pose and the side pose. In the frontal pose, the face is at a zero-degree angle with the surface, and the picture is therefore staring directly at the viewer. It is assumed to be the perfect candidate for testing, as the facial feature may be easily predicted from the frontal view. There are photos in the test set that are not precisely straight and whose facial alignment is skewed to the left or right but not the front. This is referred to as the side pose.
Additional factors may be responsible for distinguishing features and the distinct focus area in the Grad-CAM output. Instead, we separated the images of the test set into autistic and non-autistic groups. The photos were then divided in two based on gender for autistic and non-autistic groups. For both autistic and non-autistic samples, the photos were grouped afterward according to frontal and side face poses.
Table 12 describes the grouping details for deducting the mean image. Hence, based on the above table, we obtained five mean images for the autistic sample, representing the average of all autistic samples, male and female samples, and frontal and side pose samples. Similarly, for the non-autistic group, we could deduce the same images. Subsequently, the Grad-CAM was applied to these images to locate the normalized region of interest for these various deep neural network models in order to extract autistic or non-autistic characteristics.
Using the Grad-CAM heatmap,
Figure 12 depicts the focal regions of the autistic facial images for the three CNN models. The first column contains the mean images from the total number of images of autistic children as seen in
Table 11. These photos contain the facial characteristics of autism for various scenarios, such as gender and pose. So, we may conclude that the heatmap derived from the trained model with Grad-CAM on these mean images indicates the region of interest or features to be discovered in the autistic samples. Instead of selecting a random sample of male autistic children, it is justified to examine the heatmap on the mean image of all male autistic children in the test dataset. In the first row of
Figure 12, the common areas where the different models focus on male autistic samples are depicted. These areas include the nose, the middle of the eye, and the upper lips for the Xception model, a portion of the nose primarily on the lips for ResNet50V2, and the forehead for MobileNetV2. Even for other types of mean images, the Xception focuses mostly on the nose, the center of the eyes, and the upper lips.
For the frontal stance, which is regarded as the optimal condition for this type of binary classification, Xception focuses mostly on the nose, the center of the eyes, and the upper lips, whereas ResNet50V2 focuses primarily on the nose and lips, and MobileNetV2 focuses on the center of the forehead. The final row of
Figure 12 is the mean of all autistic samples in the test set, and the general focus of autistic children is identical to that of the preceding mean images.
Although the second row of
Figure 13 primarily shows the mean of three models, it can be regarded as the general compiled area of focus for recognizing autistic children. The mean image of the three models’ Grad-CAM heatmaps illustrates all the regions where the models concentrate while extracting autistic traits. A “T”-shaped region consisting of the forehead, the center of the eyes, the nose, and the lips are the frequent regions of focus when attempting to forecast the autistic sample irrespective of different genders and postures. Occasionally, models may also concentrate on the cheeks.
Figure 14 displays the mean images of non-autistic samples derived using the same methods as for the preceding autistic images. The center of the eyes, the nose, and the upper lip are the principal focal points for all non-autistic samples of the Xception model, which remain the same as those identified previously for autistic children. With the other two models, ResNet50V2 and MobileNetV2, non-autistic characteristics are identified from the peripheral area of the face rather than the center.
Hence, the overall performance and feature isolation on the face for Xception is rather consistent for almost all the cases, regardless of gender, face pose, or autistic and non-autistic samples. The Grad-CAM results conclude that the Xception model selected the same area for feature extraction and that the classification is based on nearly identical facial regions. For the other two models, however, the autistic traits are nearly identical and isolated. In contrast, the non-autistic features originate from many regions of the face, as indicated by the decentralized Grad-CAM heatmap for the mean photos of non-autistic samples.
Figure 15 shows the Grad-CAM results or focal areas of the face for the incorrectly predicted autistic samples for the pre-processing with augmentation approach, where we obtained the most accuracy in prediction. The fundamental cause is that the models focus on the incorrect areas. Uncertain as to why the models failed to concentrate on the areas required to extract facial features, we can assume that these images are in a side posture or extreme facial expression. The fact that there is no repletion in the failed images is intriguing since it indicates that various models fail to forecast distinct test samples.
4. Discussion
This research discusses the use of DL and explainable artificial intelligence (AI) to diagnose ASD using facial images. The paper highlights the importance of the early diagnosis of ASD and focuses mainly on the use of AI in this emerging medical application. The paper proposes a data-centric approach that involves pre-processing and synthesizing a large dataset of facial images of children with and without ASD. We then train some DL models using the dataset to accurately diagnose ASD from facial images using different pre-processing and augmentation techniques. In addition to providing insights into the model’s decision-making process and the components that contribute to the diagnosis, explainable AI techniques are also applied. Finally, we discuss the efficiency of this approach and compare it to other state-of-the-art methods in order to show that it beats other approaches in terms of accuracy and efficiency.
For data pre-processing, we adopt two important steps—alignment and cropping. Alignment is the process of adjusting the image orientation so that the object of interest is in a consistent position across all images in the dataset. Cropping is another process of removing unwanted parts of the image, such as the background or other objects that are not relevant to the task at hand. In addition to improving model accuracy, alignment, and cropping can also help to reduce the computational complexity of CNN models. By removing unwanted parts of the image, cropping reduces the input size of the CNN model, which can significantly reduce the number of parameters and computation required. Several studies have shown that these pre-processing steps can significantly improve the accuracy of CNN models trained on image datasets. A study by Junliang Xing et al. [
60] showed that alignment improved the accuracy of CNN models for the face recognition dataset by up to 6%. Similarly, a study by Ruoning Song et al. [
61] showed that cropping improved the accuracy of CNN models on an object recognition dataset by up to 1%.
There are quite a few studies that have been undertaken exclusively in this area. To our knowledge, this data-centric approach has never been tried for ASD diagnosis using facial image datasets to achieve higher accuracy. Previous research showed that poor image quality in the training dataset substantially contributes to inaccurate model outcomes. Pictures of children’s faces often suffer from noise, poor resolution, misalignment, and other issues. Rather, more researchers likewise concentrate on optimizing the models or set of hyperparameters with no promising improvement in accuracy. The results of the most recent studies in this area are compared in
Table 13.
Mohammad-Parsa et al. [
15] and Zeyad A. T. Ahmed et al. [
16] both applied the same strategy utilizing the MobileNet model, obtaining the highest accuracy of 95%. Although M. S. Alam et al. [
12] conducted an exhaustive ablation study to determine the optimal models and hyperparameters, they were only able to achieve an accuracy of 95% at best. In a later study, Basma R. G. Elshoky et al. [
17] employed the automated tool Hyperpot with tree-based pipeline optimization to attain a prediction accuracy of 96.6%. The 98% success rate claimed by Mohamed Ikermane et al. [
19] is not backed up by the data. The comparison with two other studies, by Taher M. Ghazal et al. [
18] and Narinder Kaur et al. [
20], whose claimed accuracy was only 87.7% and 70%, respectively, is not so significant in this regard. Compared to the previous research, the augmentation-only approach has a prediction accuracy of 92.5% with the Xception algorithm. Subsequently, after the training dataset was pre-processed, this model’s performance increased to 97.9% with the same CNN model, which is a substantial improvement in this regard. When both pre-processing and augmentation are applied to the training dataset, we obtain a prediction accuracy of 98.9%, which clearly outperforms all prior ASD diagnosis research results.
The implementation of Grad-CAM, an artificial intelligence (AI) tool that exposes the diagnostic outcomes of transfer learning models carries substantial clinical ramifications for the domain of medical diagnosis, specifically within the realm of assessing autism spectrum disorder (ASD). This explainable AI enables healthcare practitioners to enhance their ability to make informed and precise evaluations, improving patient care and facilitating well-informed treatment decisions. Lastly, we highlight the importance of carefully observing distinct facial characteristics, including the forehead, area between the eyes, nostrils, lips, and occasionally the cheeks, in children diagnosed with autism spectrum disorder (ASD) as well as normal control individuals. The identification of reliable and readily observable facial markers linked with autism spectrum disorder (ASD) can contribute to the early detection of the illness, facilitating prompt interventions and care for children affected by it. If these non-intrusive visual cues are confirmed and integrated into clinical practice, they have the potential to function as an extra screening tool that complements current diagnostic approaches. This, in turn, can have substantial advantages for individuals affected by ASD and their families, as it enables prompt access to suitable interventions and support services.
Limitation of the Study
During our research, we encountered a number of limitations that can be addressed in future studies as follows:
Firstly, there are some potential drawbacks to dataset pre-processing, as alignment and cropping can introduce some loss of information as parts of the image are removed or altered. Training time is greatly decreased after pre-processing; however, processing big datasets with high-resolution photos can be computationally expensive. Hence, we can automatically highlight specific facial areas utilizing improved nets rather than eliminating portions of the image.
Secondly, this Kaggle ASD dataset is the only openly accessible dataset in this regard on the internet and is not backed by clinical evidence. Moreover, the dataset consists only of RGB modality, not 3D (depth or shape) facial images. The lack of supporting data, such as gender, age, nationality, and sibling information, for each sample makes it impossible to validate the results demographically.
Third, this Kaggle ASD dataset is not distributed symmetrically regarding gender, facial postures, or emotions. Additionally, the data were not collected using a certain protocol or attention mechanism. Hence, when analyzing the explainability of CNN models, it is quite difficult to develop a normalized pattern or specific facial regions to focus on. We hope that the medical research institute will publish or share a comprehensive dataset that can answer all of these issues.