1. Introduction
Autism spectrum disorder (ASD) is a deficit of social communication or sensory-motor function based on genetic association and other causations [
1]. The genetic association is supported by the inheritance rate observed by Tick et al.’s [
2] meta-data analysis on twins, which determined the inheritance of ASD to range between 64 and 91 percent. Tick et al. also associated a 0.98 correlation between genetics and neurodevelopmental disorders. De novo mutations further express the relationship inheritance has on ASD because these genetic mutations happen specifically with stem cell divisions and maturation of the female gametes within the trio (father, mother, and child) [
3]. These genetic mutations are based on mutations found between the trio, determining the de novo mutation that carries the high inheritance seen within the Tick et al. analysis [
4,
5,
6].
In the
Handbook of Clinical Neurology [
7], ASD connection is associated with an estimated 1000 genes determined based on genetic linkage between chromosome location (loci) and possible genetic risk. Alarcón et al. [
8] performed a comprehensive study on chromosome regions 7q and 17q, finding that region 7q2-35 has a strong possibility of associating with ASD while also noting that other areas, like chromosome 3q, might also have an association with ASD. Copy number variants (CNV) is the process of adding, removing, or copying portions of deoxyribonucleic acid (DNA) [
9]. CNV shows the exact genetic correlation on specific regions of the chromosome band, further exemplifying the link between loci and ASD.
Another genetic association is a common variant which significantly affects ASD risk. Grove et al. [
10] used a genome-wide association study that determined common variants’ strong, robust connection with ASD. Common variants are significant factors in most diseases, as they relate to 90 percent of the differences between individuals [
11]. From an individual perspective, common variants create minimal significance, but by putting all the variants together, we determine a noticeable impact for risk [
12]. Common variants make up 40–60 percent of the risk factor when evaluating ASD [
13].
Between all of the variants and genetic mutations, we see that these mutations and variants are connected within a network. Kolchanov et al. describe gene networks as a group of genes functioning in a coordinated manner [
14]. As seen, common variants, de novo variants, and CNV are all the byproducts of genes functioning in coordination with one another, and that function creates variants with high ASD risk. These variants all combined create a gene network that links all of these variants together, showing us the association/non-association of a gene [
15]. Graph neural networks have recently been proposed to address classic yet challenging graph problems, such as node or graph classification, and have been used for graph structure learning and graph classification. The diverse applications of GNNs in these studies underscore their broad utility in addressing various problems across various domains, including social networks, healthcare, and other network structures [
16,
17,
18,
19]. Our proposed experiment uses these gene networks and graph neural networks to determine if a gene has an association and the level of risk associated with the gene. The specific graph neural network models we use are Nodeformer [
20], Graph Sage [
21], and graph convolutional network (GCN) [
22]. Within this experiment, we will use a binary and multi-class classifier to predict the likelihood of a gene being associated with autism risk. We will also use another binary classification approach to predict whether a gene is associated with a particular syndrome related to ASD. In addition, we conduct extensive experiments to determine the effect of gene location and gene network information on the classification.
2. Related Works
Genome-wide association study (GWAS) is a method used to find genetic variants that are associated with a particular disease or trait. It does this by comparing the DNA of people with the disease or trait to the DNA of people who do not have the disease or trait. Bralten et al. [
23] used GWAS to find a connection between genetic sharing between ASDs and the autistic traits ‘childhood behavior’, ‘rigidity’, and ‘attention to detail’. Grove et al. [
24] used this technique to find five genome-wide significant loci associated with autism risk. Krishnan et al. [
25] developed a machine learning approach based on a human brain-specific gene network to present a genome-wide prediction of autism risk genes, including hundreds of candidates for which there is minimal or no prior genetic evidence. Rahman et al. (2020) [
26] established a network known as a brain tissue-specific Functional Relational Network (FRN), which applies machine learning techniques to predict the genomic-activated autism-related genes. Furthermore, Ismail et al. (2022) [
27] proposed a hybrid ensemble-based classification model for predicting ASD genes using machine learning. Lin et al. [
28] employed a machine learning-based approach to predict ASD genes using features from spatiotemporal gene expression patterns in the human brain, gene-level constraint metrics, and other gene variation features. Furthermore, Brueggeman et al. [
29] utilized machine learning and genome-scale data to forecast autism gene discovery.
Genes that confer risk for ASD are likely to be functionally related, thus converging on molecular networks and biological pathways implicated in disease [
30]. Taking this idea further, Krumm et al. [
31] showed that ASD genes with de novo mutations converged on pathways related to chromatin remodeling and synaptic function. Some later studies showed that integrating known risk genes using a protein–protein interaction (PPI) network can identify novel genes involved in ASD [
32,
33]. In this paper, we are analyzing the risk assessment using binary risk association labels, multi-class risk association labels (based on a score of confidence), and binary syndromic association for whether the gene is associated with an overarching medical condition.
Use of Graph Neural Networks to Predict Disease
In machine learning, many techniques have been used to predict diseases using gene networks, risk assessment ASD, and overall disease risk discovery using machine learning techniques. The first is the interaction discovery made in protein–protein interaction networks by Fenq et. al, who discovered using omic data that they can determine new links with these interaction networks [
34]. Wang et al. used attention-based graph neural networks to identify ASD based on the activity within brain scans [
35]. Beyreli et al. created DeepND, a multitask graph convolutional network that used an autism network and an intellectual disability network to determine the risk for both [
36]. They achieved a median AUC of 87%. Lu et al. used a graph neural network and a patient network to classify whether a patient suffers from a chronic illness or not [
37]. Wang and Avillach used DeepAutism, a convolutional network designed to diagnose ASD based on the presence of high-risk gene variants [
38]. They achieved 88.6% accuracy. Motsinger et al. used gene–gene interaction networks and neural networks to classify the risk people carry for Parkinson’s disease [
39]. Laksshman et al. created deep bipolar, which specializes in identifying gene mutations to determine whether somebody is bipolar [
40].
3. Methodology
3.1. Dataset
The datasets used for these experiments were the Sfari dataset [
41] and protein interaction network (PIN) data [
15]. The Sfari dataset contains gene associations and rankings (labels). It also contains the chromosome band location of each specific gene. For our binary risk association and multi-class risk association classification, we used the confidence score on Sfari, which ranks from most confident to least confident association in ASD (ranking from 1 to 3).
Case (1) Binary risk association classification. If a gene is contained in the SFARI dataset, it is automatically considered to contain risk for ASD. The two classes for this are as follows:
- 1
Gene with associated risk;
- 2
Gene without associated risk.
Case (2) The multi-class risk association classification uses three risk levels for its labels. The classes for this classification are as follows:
- 1
No gene association;
- 2
Low gene association;
- 3
Moderate gene association;
- 4
High gene association.
Case (3) Syndromic gene classification. (Syndromes are collections of multiple, related medical signs and symptoms that occur together. Mutations in a syndromic gene can lead to a variety of problems affecting different parts of the body and causing a recognizable pattern of symptoms.) The syndromic risk association shows us the identification of syndromic and possible non-syndromic genes. (Sfari dataset lists all specifically syndromic gene associations). The classes are as follows:
- 1
Syndromic gene;
- 2
Non-syndromic gene.
The PIN dataset comprises data on protein–protein interactions, elucidating which proteins interact. These interactions can be represented as a network or graph, where proteins are nodes, and interactions between them are edges or connections. The PIN dataset contains all protein–protein interactions both associated and not associated with ASD.
3.2. Preprocessing
The first preprocessing step is to filter out anything in the PIN dataset that is not specified as being a human gene interaction. Next, we add the chromosome band location and labels (binary, multi-class, and syndromic) from the Sfari dataset to the genes in the PIN dataset to have our edges and associated labels. The chromosome band locations obtained from the Sfari dataset are used as node features. Equation (
1) shows the definition of the graph:
G: The graph representing genes and their interactions.
V: The set of nodes, where each node represents a gene and is associated with its location (feature).
E: The set of edges representing interactions or connections between genes in the graph.
C: The set of labels representing autism risk for each gene, where each gene node is associated with a label indicating its classification (high confidence, strong candidate, suggestive evidence).
The chromosome band is placed in a one-hot encoder that splits our feature into a list of 472 binary features representing every possible location where the gene has been observed. This is then also followed up with using our edge list to create an adjacency matrix and then passing our labels. In the case of multi-class classification, the high confidence class contains 214 genes, the strong candidate class includes 530 genes, and the suggestive evidence class has 69 genes. This is paired with 11,403 no associated genes, which shows a severe imbalance. To fix this imbalance, we upsample [
42] all of the genes to ensure we have a balanced classification among all four classes. For the binary class, we take the high confidence class, strong candidate class, and the suggestive evidence class together, which is 813 genes with the same paired 114,03 non associated genes. Lastly, the syndromic labels are split into 169 syndromic genes and 12,047 that are not syndromic. The preprocessing steps are visualized in
Figure 1.
3.3. Models
The models used for this experiment are graph convolutional network [
22], Graph Sage [
21], and Multi-Layer Perceptron (MLP). The models each contain two layers, each having a rectified linear unit (ReLu) between the layers and ending with a multilayer perceptron (MLP). These models are set up to use the adjacency matrix, feature matrix, and selected labels to create either a binary or multi-class classification.
3.4. Graph Convolutional Network
GCN uses the convolutional properties of a graph network to facilitate a connection of both the features and the network. This process is performed in a layer of the GCN called the propagation Layer (Equation (
2)), where l is the layer of the GCN operation;
is the feature vector of node v in the l-th layer;
is the set of neighboring nodes of node v;
is the weight matrix for the l-th layer;
is a normalization factor for node v; and
is an activation function (e.g., ReLU):
GCN is more of a top-down approach, which looks at the entire picture of the network and its feature matrix for performing calculations. This model ignores the effect a singular node has on the network but instead looks at all of the interactions together through matrix multiplication. Once this is done, we obtain our embedding matrix, allowing us to classify what class a node belongs to.
3.5. Graph Sage
Graph Sage takes an aggregation of all neighboring nodes in the network. The Graph Sage operation is defined in Equation (
3), where l is the layer of the GraphSAGE operation;
is the feature vector of node v in the l-th layer;
is the set of neighboring nodes of node v;
is the weight matrix for the l-th layer;
is the aggregation function (e.g., mean, sum, or LSTM-based aggregation); and
is the activation function (e.g., ReLU):
Mean aggregated features are then concatenated to create new features for every node in the feature matrix. This model infers the connection of neighboring nodes. This connection creates another approach but an essential method for interpreting graph networks. Instead of using a broad look like GCN, this takes a neighboring approach, which instead infers the connection a feature relationship that not only the current node but its neighboring nodes have with each other.
3.6. Graph Transformer (Nodeformer)
NodeFormer [
20] is a graph transformer model known for its all-pair attention mechanism, enabling efficient graph data processing. Traditional graph neural networks propagate signals over a sparse adjacency matrix, while graph transformers [
43] can be seen as propagating signals over a densely connected graph with layer-wise edge weights. The latter requires estimation for the N*N attention matrix and feature propagation over such a dense matrix. At each layer of the graph transformer, the embeddings of the current layer are mapped to query, key, and value vectors. Then, all pair attention is calculated for aggregating the features. Equation (
4) shows how the attention operation is applied to node u:
z denotes the node embeddings, and q, k, and v denote the query, key, and value vectors. The
,
, and
are learnable weights at the k-th layer. The all-pair attention operation in Equation (
4) has
complexity. However, when it gets applied to all nodes in the graph, its complexity explodes to
. To avoid this, Nodeformer decouples the summation operation from the dot product as shown in Equation (
5). In this approach, the summations are independent of the node u. Therefore, the summations can be precomputed in
complexity and then shared among all nodes. The layer embedding calculation will thus be of
complexity:
6. Discussion
The results highlight that Graph Sage surpasses the baseline model in every case. The GCN surpasses the baseline in most cases but not all. This implies that while graph structures offer valuable information for the three classification problems, their utility may differ in specific contexts. The graph transformer model does not surpass the Graph Sage model in our experiments. It remains to be seen whether a larger dataset will improve the graph transformer model performance beyond the other models. The featureless models perform significantly worse than their counterparts in every case. This shows that the node features (Chromosome band locations) also contribute heavily to the classification. From the t-SNE charts, we can see clear clusters forming for each of the classes. In the case of multi-class risk association, the moderate-risk genes seem to fall in between the high-risk and low-risk genes. This shows that the risk levels are a spectrum ranging from low to high. Analyzing the top genes shows that the ASD risk-associated genes discovered by the models tend to be genes connected to brain or neuronal development. This further shows the ability of the models to distinguish the risk levels. However, further study of the top genes is needed before we can confirm this.
Limitations of our study can stem from the source datasets. For example, autism datasets such as Sfari tend to contain data from clinical populations, and thus they could exclude individuals with mild levels of autism. Therefore, our models could also be biased towards severe autism levels. Another potential source for bias is the under-representation of certain groups such as minority groups from the datasets. This could affect the classification accuracy of models when it comes to those populations. The benefits of our approach include the ability for early identification and intervention. In the case of autism, such early interventions can improve therapy outcomes. Research has shown that involving pediatricians as initial diagnosticians in multidisciplinary evaluations for young children with ASD [
51]. The biomarkers identified by our method could be used alongside other behavioral and brain imaging techniques as a comprehensive early diagnosis workflow. Since our method has good explainability through predictive features and t-SNE visualizations, it gives confidence to doctors and patients when used in a diagnostic setting.
Our method also holds promise for targeted therapeutic interventions for individuals with autism. Precision medicine approaches can be developed to target underlying genetic mechanisms contributing to ASD. This may include pharmacological interventions aimed at modulating neurotransmitter systems or gene therapy strategies to correct genetic abnormalities.