6.1. Discussion
As a concluding remark to the interpretation of the results, what was seen was that a classifier’s performance does not solely rely on its test and cross-validation accuracies. Although cross-validation is a robust way to build classifiers and gives a general indication of how well the model will perform, models should be chosen based on their practical application as well. For example, the classifier built for this study was to be used for screening of ADHD. This means that other metrics become very relevant for assessing the model. Such metrics include sensitivity, specificity and recall.
According to the statistical analysis that was done to estimate a sample size, it was recommended that a total of 200 subjects be used in order to achieve a model accuracy of 84%. The main aim of the study was to conduct a clinical trial, given this sample size. The first step was to perform beta-tests on a smaller population (N = 30) in order to demonstrate the validity of the use of machine learning models. Although the beta-test results were seen as preliminary results, they were indicative enough to be used to demonstrate that the research question could be answered. Given the time constraints on the study, it was decided that clinical trials would form part of future work. The aim was to develop a screening tool for ADHD, and the beta-test was able to provide a solution for that.
The machine learning model that was implemented was SVM with a linear kernel. Due to the high dimensionality of the dataset, features were extracted through statistical and morphological analysis. Feature selection was then performed in order to have the most representative feature subset. Due to the small size of the dataset, leave-one-out cross- validation was chosen to determine the generalization error of the classifier, as well as to tune the regularization parameter. The feature set that was chosen consisted of 21 features that were selected using sequential forward selection. This feature selection method outperformed the other 3 methods that were used. The selected features included 11 of the game-play features and 10 of the features extracted from the accelerometer.
It can be seen that the test set accuracy and LOOCV accuracy are both high. This is expected given a small dataset. The sensitivity (TPR) relates to the classifier’s ability to classify ADHD test subjects as having ADHD. Sensitivity was therefore an important characteristic of the classifier, especially for screening. Good classifier performance would require for the classifier to correctly identify subjects that are ADHD. Here the sensitivity is 0.75. This means that 75% of the time, the classifier will be able to detect the presence of ADHD. Although ADHD is sometimes difficult to detect, even with classical methods, a sensitivity of 75% is quite low. The specificity here of 1 shows that all non-ADHD test subjects were correctly classified.
Performance metrics of the classifier revealed that although the test and LOOCV accuracies were good (85.7% and 83.5% respectively) care had to be taken when selecting a classifier as being optimal. Important metrics, especially for diagnosing/screening conditions included specificity and sensitivity, which relate to how well a classifier correctly rules out negatives and correctly includes positives. From a screening point of view, the penalty is not as large as for diagnosis, but it is most desirable to have very high sensitivity and acceptable to high specificity. It was seen that the sensitivity was 75% while the specificity was 100%. The sensitivity was seen as low, while the specificity, although being high, was specific to this small dataset and would most likely decrease with a bigger dataset.
The positive predictive value (PPV) relates to the relevance of the outputs that were classified. A precision of 1 means that all the outputs that were classified were relevant. The negative predictive value (NPV) shows that 75% of relevant targets were selected.
The F1 score shows the balance between precision and recall. Values of F1 that are very high or very low, show that precision and recall are not well balanced. This appeared to be the case with this classifier. The high value of 85.7% suggests that the model may have high precision and low recall, or vice versa.
The type I error of 0 suggests that the null hypothesis was true, and accepted. Although this metric is not indicative given the dataset, it would have been approximately equal to 0.05 for a larger set. The type II error of 0.25 is quite large and suggests that there is 25% probability that the classifier may predict false positives.
In addition to the performance metrics that were discussed, a comparison of the test set distribution and target set distribution was made. The following observations were made: (1) the target values comprised of 4 ADHD subjects and 3 non-ADHD subjects; (2) the predicted values comprised of 3 ADHD subjects and 4 non-ADHD subjects; (3) the test set comprised of 2 boys, 1 of which was ADHD; (4) the test set comprised of 5 girls, 3 of which were ADHD; (6) all the boys with ADHD were classified correctly; (7) all the boys without ADHD were classified correctly; (8) Out of the 3 girls with ADHD in the test set, 2 were classified correctly; (9) all the girls without ADHD were classified correctly.
Although no major conclusions can be drawn from these few observations it is interesting to note that the classifier was able to correctly reject all the boys and girls that didn’t have ADHD, as suggested by the 100% specificity. Contrary to the claim that boys are more misdiagnosed than girls, the test set shows that all boys were correctly classified. This observation does not resolve the claim, however, since the dataset was not representative enough of a wider ADHD population.
A comparison of this study’s results with other studies and existing tools pertaining to the objective diagnosis of ADHD reveal that the results are close enough, especially considering the small size of the dataset. More specifically, the sensitivity of the proposed method was generally outperformed by the other methods by at least 5%. The specificity found for this method was 100% and this was seen as a biased result, that couldn’t be used as representative of the method. The accuracy of the proposed method also performed moderately, although being lower than the other methods by at least 5–7%.
6.2. Conclusion
The biggest disadvantage of the method is the small sample size. The significance of this is that the results cannot be treated as conclusive but only indicative. The confidence in the classification is not great, as over-fitting is likely to occur with such a small sample size. However, it has been demonstrated by [
2] that SVM can be used for ADHD diagnosis with a sample size of 42, which is the closest study in terms of sample size to date.
What is advantageous about this method is that the method has not yet been explored, in the sense that a game has not been used for screening purposes. Another advantage is the ability to provide screening, without the need of going to a specialist, as this tool could be used by parent’s and teachers. The method could curb costs quite significantly, by doing early screening and possible detection, as well as limit over-diagnosis.
Due to the complexity of game development, a simple game with minimal features was implemented. Next, it is recommended that a more interactive and complex game be developed, where more features can be extracted, and more parameters can be monitored. A more complex implementation would give a feature set with higher quality and possibly better classifier performance. Furthermore, many studies have shown that the use of multivariate-time-series (MTS) can help accurately classify diseases such as cancer and even ADHD. Such MTS data is found in the signals of electroencephalograms (EEG), electrocardiograms (ECG) and electromyograms (EMG). These could be implemented into the game by placing sensors and electrodes on subjects. Additional physiological markers could be added, such as eye tracking and heart rate.
The study that was conducted was able to suggest an answer to the research question that was presented, that is: a person can be screened for ADHD using quantitative methods. It was seen that the classifier showed acceptable results, especially considering that those results were only preliminary. It was demonstrated that, given a data acquisition method, in this case being the game tablet, meaningful data could be extracted and used to build a predictive model. The methods that were used to build the model were based on an extensive literature review, where it was shown successfully how those methods were performed with reliability and repeatability. Therefore, the classifier developed for the study was not novel in itself, but it was the whole design process that was novel.