1. Introduction
Single-cell RNA sequencing (scRNA-seq) is a technology that measures the gene expression of individual cells using high-throughput sequencing. It can analyze millions of cells from different biological samples and conditions [
1,
2,
3,
4,
5]. This technology can reveal the diversity and characteristics of cells and help identify their types [
6]. It can also provide new insights into how cells interact and function in complex systems [
7,
8,
9]. Moreover, scRNA-seq can support research in healthcare, drug development, and biotechnology [
10,
11,
12,
13]. Single-cell RNA sequencing (scRNA-seq) has enabled powerful and straightforward access to the transcriptomes of individual cells [
1]. With the help of scRNA-seq technology, it has become easier to understand individual cell organizations, leading to a better understanding of immunity and various diseases. Cancer is often considered the most heterogeneous and complex of all diseases [
14]. The presence of cancer stem cells is a significant source of tumor formation, drug resistance, and metastasis [
15]. The early detection of cancer stem cells is crucial to ensuring the adequate diagnosis and treatment of cancer. scRNA-seq aids in detecting genetic information and controlling genes and differences in gene expression within individual cells [
16]. Therefore, with the support of scRNA-seq technology, it is easy to understand the heterogeneity within tumors, analyze cancer stem cells, and map clones in the tumor. scRNA-seq can provide valuable information for cancer research. Due to all these characteristics, scRNA-seq is increasingly used in the study of cancers. As scRNA-seq becomes more advanced and affordable, more and more scRNA-seq data are generated and used in various biomedical fields. This differs from traditional methods that analyze the average expression of thousands of cells in a sample, which can miss the variability and details of single cells.
Identifying cell types from single-cell transcriptomics data is an important goal in helping to explain the diversity and complexity of tissues and organisms. There are several methods to obtain single-cell sequencing data, such as Drop-seq [
17], inDrop [
18], Chromium [
19], and smart-seq2 [
20]. However, these methods can produce different kinds of noise [
21,
22] and batch effects [
23] in the data, which can affect the accuracy of cell identification. Batch effects can also arise from different platforms [
24,
25], omics types [
26], and species [
4,
27] in the data. Cross-species single-cell analysis is a new field that can study how cells evolve and develop across different species [
28]. However, cross-species data can have more severe batch effects. Reducing these batch effects is a significant challenge for cell identification methods.
One way to analyze scRNA-seq data is to cluster cells into different groups [
29]. However, many clustering methods have limitations, such as needing to know the number of groups in advance or taking too much time and memory [
30]. Some methods can identify cell types from various scRNA-seq data, such as Seurat (
https://satijalab.org/seurat/, accessed on 26 August 2022) [
31], Conos (
https://www.nature.com/articles/s41592-019-0466-z, accessed on 26 August 2022) [
32], scmap (
https://www.nature.com/articles/nmeth.4644, accessed on 26 August 2022) [
33], and CHETAH (
https://academic.oup.com/nar/article/47/16/e95/5521789, accessed on 26 August 2022) [
34]. Seurat [
31] is a software tool for single-cell RNA sequencing data analysis, which provides functions such as data quality control, single-cell clustering, differential expression gene identification, gene function annotation, and pathway analysis, as well as visualization analysis. In addition, Seurat v3 proposes a cell-type identification method based on anchors that can be applied to different single-cell datasets. Conos [
32] is a method for integrating multiple single-cell RNA-seq datasets, focusing on aligning homologous cell types across heterogeneous sample collections. It generates a joint graph representation through pairwise alignments, enabling the propagation of labels from one sample to another. Scmap [
33] is a method for projecting cells obtained from scRNA-seq experiments onto cell types identified in different experiments. It learns cell types by measuring the maximum similarity between a reference dataset with good cell annotations and an unknown dataset. CHETAH [
34], guided by existing reference data, defines a classification tree for top-down classification in data lacking annotations. Although these methods can be effective in various scenarios, they still have limitations. One problem is that they only use the information from each cell and ignore the connections between cells. Graph convolutional networks (GNNs) are a new way to analyze data, such as cells and their connections, with a graph structure [
35]. GNNs can learn the features and relationships of cells and improve the performance of different tasks. Graph convolutional networks (GCNs) are GNNs applied to single cells and diseases [
36,
37,
38,
39,
40]. GNNs have also been used to analyze scRNA-seq data, such as imputation and clustering [
41,
42,
43]. These results inspire us to use GNNs for cell identification, a new and promising direction.
Most methods for cell identification use the information from each cell, but scGCN [
44] uses graph convolutional networks (GCN) to use the connections between cells. This improves the performance of cell identification and shows the potential of graph convolutional networks (GNNs) for this task. Different graph structures can capture different information from scRNA-seq data, but scGCN depends on one graph structure. Multi-view learning can use multiple graph structures to reduce the batch effects from different sequencing methods. In this article, we use a multi-view graph convolutional network model (scMGCN) for cell identification on scRNA-seq data.
Figure 1 provides a simplified workflow of our approach.
The main contributions of this article are the following:
We use multi-view learning and multiple graph construction methods to create different graph structures from raw scRNA-seq data. We then use graph convolutional networks to learn from these graphs and get complete information about the cells. This also helps us reduce the batch effects from different sequencing methods in the cell identification task.
We develop scMGCN, a multi-view graph convolutional network that uses graph convolutional networks and attention mechanisms. scMGCN can learn the common and specific information from each cell and the relationships between cells.
Through benchmarking with other state-of-the-art cell type identification methods on single-species, cross-species, and cross-platform datasets, scMGCN consistently demonstrates superior accuracy in different tissues, platforms, and species.
This paper is structured as follows: In the
Section 4, we describe the data, the data preprocessing, the graph construction, and the model details. In the
Section 2, we report the performance evaluations. In the Discussion section, we discuss the strengths and limitations of our method. In the Conclusions section, we summarize our main findings and contributions.
2. Results
2.1. Performance of Cell Type Identification on Single Dataset
We compare the performance of scMGCN with the other five methods for cell type identification on single-cell RNA sequencing (scRNA-seq) data from five datasets, which have various challenges in cell type recognition, such as heterogeneity, batch effects, and rare cell types. The compared methods contain Seurat v3 [
31], Conos [
32], scmap [
33], CHETAH [
34], and scGCN (
https://www.nature.com/articles/s41467-021-24172-y, accessed on 26 August 2022) [
44]. We use accuracy as the primary evaluation metric and perform five-fold cross-validation, reporting the average results. We also calculate the F1-macro scores to measure the performance of minority cell types. We find that scGCN outperforms the other methods regarding accuracy and F1-macro scores. The results of both metrics are shown in
Table 1 and
Table S1.
Our model achieves the highest mean accuracy (90.61) across all datasets, slightly better than Seurat v3 (89.84) and scGCN (87.44). scMGCN also outperforms Conos (83.85), scmap (77.33), and CHETAH (76.84) by a more significant margin. In a few datasets, such as GSE98638 and SRP073767, scMGCN achieved the second best performance. The F1-macro scores of scMGCN vary more than the other methods across the datasets, but scMGCN still shows good stability in terms of performance. These results demonstrate that scMGCN can effectively integrate multiple graphs and complement the missing information in a single graph for cell type identification.
2.2. Cell Type Identification across Datasets of Different Species
We evaluate the performance of scMGCN and other four methods, Seurat v3, Conos, scmap, and scGCN, for cross-species cell type identification on two pairs of human and mouse scRNA-seq datasets. We exclude CHETAH from this experiment because it does not perform data integration across species. We split the training set into a training subset (80%) and a validation subset (20%) within each cell type. We use accuracy and F1-macro scores as the evaluation metrics and perform five-fold cross-validation, reporting the average results.
Table 2 and
Table S2 show a performance comparison between our method and the compared methods on the cross-species datasets. The results show that scMGCN achieves the highest mean accuracy (75.12) across both pairs of datasets, much better than Seurat v3 (62.43), Conos (65.57), scmap (62.83), and scGCN (66.84). This indicates that scMGCN can extract and aggregate the shared high-order cell relationships from multi-view graph data. scMGCN also has the highest mean F1-macro score (71.95), indicating its ability to handle rare cell types across species. These results demonstrate the superior performance of scMGCN in cross-species cell type identification.
2.3. Cell Type Identification across Datasets of Different Platforms
We perform cross-platform analysis on six human peripheral blood mononuclear cell (PBMC) datasets and four human pancreatic cell datasets from diverse platforms and sequencing methods. We split the data from different platforms into training and testing sets to ensure rigorous evaluation. Moreover, we further divide the validation set in each cluster based on a 20% ratio in the training set. Using the same evaluation criteria, we compare the accuracy and F1-macro scores of scMGCN with other methods on cross-platform datasets.
Table 3 shows the specific comparison results. Our results demonstrate that scMGCN outperforms other methods in identifying cells across different experimental platforms. Specifically, in the PBMC datasets, scMGCN (88.52) achieves the highest mean accuracy, followed by Seurat v3 (83.46), Conos (81.92), scGCN (85.45), and scmap (64.03). ScMGCN performs well even when the training set contains less than 100 cell samples in PBMC (CEL-Seq2 (
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0938-8, accessed on 26 August 2022))-PBMC (10× v3 (
https://www.10xgenomics.com/, accessed on 26 August 2022)) datasets. From
Table 4, we can observe that scMGCN (98.98) also achieves the highest mean accuracy, followed by scGCN (98.87), Conos (43.74), scmap (86.75), and Seurat v3 (98.79). scMGCN also outperforms other methods regarding mean F1 score in the classification of human pancreatic cell datasets. These results indicate that scMGCN can handle batch effects across platforms better than other methods. The detailed results are shown in
Table S3.
2.4. Ablation Experiments
To further evaluate the performance of scMGCN, we compare the performance of multi-view fusion and different network modules in different types of experiments.
2.4.1. The Performance of Multi-View Fusion
To validate the advantage of multi-view fusion, we compared the performance of models using multi-view and single-view in the single dataset, cross-species, and cross-platform experiments. The following steps obtained the single-view results: First, the graph structures of the reference dataset and query dataset were obtained using the corresponding graph construction methods. Then, the graph convolutional network was used to train the single-view, and the feature embeddings of cells with different graph construction methods were obtained. Finally, the cell types of the query dataset were predicted by the classification module. The experimental results are shown in
Figure 2 and
Figures S1—S4. The figures show that the model using multi-view fusion had a consistently higher performance than the model using single-view in the single dataset, cross-species, and cross-platform experiments, and the performance had good stability in different datasets.
2.4.2. The Performance of Different Network Modules
To validate the effectiveness of each module of scMGCN, we compared the performance of scMGCN and two other existing graph neural network-based multi-view fusion models in the single dataset, cross-species, and cross-platform experiments. These models are heterogeneous graph attention networks (HAN) and Multi-Omics Graph Convolutional NETworks (MOGONET). HAN is a model for analyzing heterogeneous graph data, which uses multiple GATs to process the semantic information of different meta-path graphs and then uses the Attention mechanism to fuse different semantics. The structure of this model can be directly applied to multi-view fusion. MOGONET is a model that uses multiple GCNs to extract information from different views, and further uses a view correlation discovery network View Correlation Discovery Network (VCDN) to aggregate them. The experimental results are shown in
Figure 3 and
Figures S5–S8. From the figures, it can be seen that scMGCN had a consistently higher performance than the other two models in the single dataset, cross-species, and cross-platform experiments, and the performance had good stability in different datasets.
2.5. Model Details and Computational Resources
The specific implementation of the scMGCN model consists of three main components: the Graph Convolutional Layer module, the Attention module, and the MLP module. The Graph Convolutional Layer module comprises six GCNs, with each GCN containing three layers of GraphConv and a dropout layer. The Attention module consists of two linear layers and a tanh activation function. Finally, the MLP module comprises a single linear layer. For specific parameters, please refer to the code.
We compared the time and memory consumption of Seurat v3, Conos, scmap, CHETAH, scGCN, and scMGCN on the single dataset on a system with 128GB memory, an i7-13700F (8-core, 24-thread) CPU, and a 4090 GPU. Seurat v3, Conos, scmap, and CHETAH are all executed on the CPU, while scGCN and scMGCN are executed on the GPU. The results are summarized in
Table 5.
3. Discussion
Some methods are available for identifying cell types from various scRNA-seq data, such as Seurat, Conos, scmap, and CHETAH. However, these methods ignore the higher-order relationships among cells. Graph neural networks (GNNs) are a new method that uses graph structures to analyze data, such as cells and their connections. A representative GNN model is scGCN, which has learned higher-order relationships among cells but is limited by single-graph learning. We propose scMGCN, a model that uses multi-view graph convolutional neural networks to identify cell types in single-cell RNA sequencing data. The model adopts multi-view learning, generates different data views using various graph construction methods, and then uses graph convolutional neural networks to learn node representations. Finally, the attention-based multi-view embedding aggregation layer combines the learned node representations for cell type identification. We conducted comparative experiments on individual datasets and data from different species and platforms. The experimental results show that scMGCN performs well in single-cell identification tasks, especially in cross-species and cross-platform scenarios. The experimental results show that, although scMGCN had a smaller improvement compared with other methods in a few cases, such as the single dataset cell type recognition task compared with the scGCN method, in most cases, the scMGCN method achieved a better performance compared with other methods.
As an efficient cell type identification method, scMGCN will have widespread applications. Efficient cell type identification techniques are crucial in clinical diagnosis, cell development and differentiation research, and drug development efforts. For example, we can extend scMGCN to the diagnosis of blood diseases. By using scMGCN to classify blood cells from patients, we can determine their pathological types and provide a basis for clinical treatment. We can also use scMGCN to conduct in-depth research on specific cell types, helping researchers understand the role of different cell types in disease development and progression and thus providing new targets and strategies for drug development. In summary, the proposed cell type identification model, scMGCN, has broad and profound applications in medicine, biology, and related fields. It is of great significance for understanding the processes of life and preventing and treating diseases.
Although our method achieved good results on different datasets, there are still some limitations to our approach. Regarding dataset quality, our method is still influenced by the reference dataset. The quality of the reference dataset directly affects the effectiveness of our final model, so in the future, we can introduce some preprocessing methods for the dataset to improve its quality. Secondly, our method currently only processes single-cell RNA-seq-related data. Our model achieved different results on different single-species, cross-species, and cross-platform datasets, demonstrating high sensitivity to different datasets. In the future, we can further adjust the relevant parameters and the number of network layers in the model to overcome our model’s sensitivity to data variations. Regarding data learning and interpretation, our model adopts an attention-based multi-view aggregation method to learn and interpret data from different perspectives. Although this improves the performance of our model, scMGCN currently does not assign explicit edge weights to the relationships between units in each graph. It only reflects correlation but cannot reflect the strength of relationships. In the future, the introduction of edge weights may be considered. However, it is necessary to consider how to handle the differences in edge weight definitions between different graph construction methods.
Finally, with technological advancements, the analysis of single-cell RNA sequencing data (scRNA-seq) also faces challenges. Firstly, the amount of data generated by scRNA-seq technology is vast, and there is much noise. Secondly, analyzing cell heterogeneity, incredibly accurately identifying and distinguishing different cell subpopulations, remains challenging. Furthermore, differences in experimental conditions, platforms, and batches make it difficult to compare different datasets directly. In this context, new algorithms and technologies like the scMGCN model provide new opportunities and possibilities for addressing these challenges. scMGCN utilizes different graph construction methods and graph convolutional networks to analyze single-cell data, enabling it to capture the similarities and differences among cells better and to be used for cell type identification and classification. However, despite the many advantages of the scMGCN model, it still faces some challenges. An example of a challenge in machine learning is improving the model’s generalization ability to handle more extensive and higher-dimensional data. Another challenge is integrating the model with other advanced machine learning techniques to enhance the accuracy and reliability of cell type identification.
4. Materials and Methods
4.1. Data Collection
The rapid development of single-cell technologies has led to a significant increase in single-cell omics data. As more single-cell datasets become available, there is an urgent need to leverage existing and newly generated data in a reliable and reproducible manner, learn from well-established single-cell datasets with clearly defined labels, and transfer these labels to newly generated datasets to assign cell-level annotations. However, existing and newly generated datasets are often collected from different tissues and species under various experimental conditions, through different platforms, and across different omics types. Therefore, to meet the demands of practical applications, we conducted three types of experiments to evaluate the performance of scMGCN on individual datasets and datasets from different species and platforms. These datasets represent different scenarios and challenges for cell label transfer. To highlight the comparison of specific cell types, we evaluated the performance of scMGCN on individual datasets. Next, we evaluated the performance of scMGCN on datasets from different species. Since different experimental platforms generate emerging single-cell datasets, we tested whether scMGCN can accurately transfer labels between datasets from different platforms.
All datasets were obtained from public databases, as shown in
Table 6. For evaluating the performance of the single-dataset experiment, we used five datasets as follows: the dataset GSE115746 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115746, accessed on 26 August 2022), which contains 9035 cells from the mouse anterior lateral motor cortex (ALM); the dataset PHS001790, which contains 12,552 human cells from the middle temporal gyrus; the dataset GSE118389 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118389, accessed on 26 August 2022), which contains 1534 cells from six human triple-negative breast cancer (TNBC) tumors; the dataset GSE72056 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056, accessed on 26 August 2022), which contains 4645 human melanoma cells; the dataset GSE98638 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98638, accessed on 26 August 2022), which contains 5063 T cells from patients with hepatocellular carcinoma; the dataset GSE85241 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241, accessed on 26 August 2022), which contains 2122 pancreatic cells from human cadavers; the dataset GSE109774 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109774, accessed on 26 August 2022), which contains 54,865 single cells from 20 tissues of 3-month-old mice; the dataset SRP073767 (
https://www.ncbi.nlm.nih.gov/Traces/index.html?view=study&acc=SRP073767, accessed on 26 August 2022), which contains 65,943 human peripheral blood mononuclear cells; and the dataset GSE120221 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120221, accessed on 26 August 2022), which contains 10,495 bone marrow mononuclear cells. We used two pairs of data sets to evaluate the performance across data from different species. The first data pair is from the GSE115746 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115746, accessed on 26 August 2022) and PHS001790 (
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001790.v2.p1, accessed on 26 August 2022) datasets, which contain 9035 mouse and 12,552 human brain cells. The second data pair is from the GSE120221 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120221, accessed on 26 August 2022) and the GSE107727 (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107727, accessed on 26 August 2022), which contain 10,495 human bone marrow mononuclear cells and 30,494 mouse bone marrow mononuclear cells. We used one species as the training set and the other as the testing set for each pair. We used two data types to evaluate the performance across data from different platforms. One type is human peripheral blood mononuclear cell (PBMC) data, which are sequenced by different methods: PBMC (10× v2), PBMC (Smartseq2), PBMC (InDrop), PBMC (10× v3), PBMC (CEL-Seq2), and PBMC (Seqwell). The six PBMC data are available from the Broad Institute Single Cell portal (
https://portals.broadinstitute.org/single_cell/study/SCP424/single-cell-comparison-pbmc-data, accessed on 26 August 2022) and the Zenodo repository (
https://zenodo.org/, accessed on 26 August 2022). The other is human pancreatic cell data, which are also sequenced by different methods: GSE85241 (CEL-Seq2) (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241, accessed on 26 August 2022), GSE81608 (Smart-seq2) (
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81608, accessed on 26 August 2022), and E-MTAB-5061 (Smart-seq2) (
https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061, accessed on 26 August 2022). We choose these single-cell datasets because they are often used to evaluate the performance of cell type identification through different methods [
31,
33,
44,
45,
46].
4.2. Data Preprocessing
We need to preprocess the data to construct graphs and use models from sequencing data to reduce noise. We followed these steps for preprocessing. Firstly, we removed doublet cells in droplet-based methods like drop-seq and inDrop. Doublets are cases where more than one cell is in a droplet. We used Scanpy to calculate and filter out doublet scores. Next, we only kept genes with at least one read and matched the reference and query sets. This made sure that both sets had the same gene dimensions. Then, we selected highly variable genes using ANOVA for multi-category differential analysis. This helped us find the most variable genes across different cell types. We used Bonferroni correction to choose 2000 genes in the reference set with the lowest adjusted p-values. We removed all non-variable genes in both sets.
4.3. Graph Construction
After data preprocessing, we constructed graphs using different methods to calculate cell relationships, build adjacency matrices, and generate mixed graph adjacency matrices for use as inputs in graph convolutional layers. We used the Scanpy library to normalize total counts for each cell, ensuring that each cell had the same total count after normalization, which helped to minimize the error in similarity calculations due to cell heterogeneity.
In this study, we used six graph construction methods for multi-view aggregation, representing six edges in the multi-view graph. These methods included Approximate Nearest Neighbors Oh Yeah (ANNOY), CCA_MNN, Harmony, Scanorama, Scmap, and Autoencoder-KNN. The relevant details of these six graph construction methods are shown in
Supplementary Materials. Each method first constructed a data inter-graph
containing the cell relationships within the reference and query datasets, followed by constructing a data intra-graph
containing the cell relationships within the query dataset. Both graphs were represented as adjacency matrices,
and
, where
and
are the numbers of cells in the reference and query datasets, respectively. These two graphs were combined to form an adjacency matrix
, where
represented the relationship between the
ith and
jth cells, with
=0 indicating the absence of a relationship. In addition to the adjacency matrix, we used a feature matrix X=
∈
, composed of the preprocessed reference and query matrices, as the starting input features for the graph convolutional layer. Each row represented the starting feature vector of the corresponding node. The starting feature matrix input was the same for all graph construction methods.
4.4. Related Comparative Methods
We compared five commonly used methods for cell type identification based on single-cell RNA sequencing data, including Seurat v3 [
31], Conos [
32], scmap [
33], CHETAH [
34], and scGCN [
44]. Seurat v3 is a widely used and well-validated toolkit for single-cell genomics. Recently, Seurat v3 introduced an anchor-based label transfer method that has broad applicability and can be used with various single-cell samples. Conos utilizes the pairwise arrangement of samples to construct a joint graph representation, allowing label information to be transferred from one sample to another. Scmap learns and infers the cell types in an unknown dataset by comparing its maximum similarity with a reference dataset that has good cell annotations. With the guidance of reference data, CHETAH constructs a classification tree for the top-down classification of unannotated data. scGCN utilizes graph convolutional neural networks to learn complex relationships between cells from constructed graph data in order to complete the task of cell type recognition.
4.5. Model Design and Model Training
In this paper, we utilized a graph convolutional neural network for feature learning and employed multiple graph construction methods to describe the same cell data. By leveraging different graphs that capture information from diverse perspectives and can be represented as distinct types of edges, we consider the entire model as a multi-view graph convolutional network. To aggregate different embedding information, we adopt an attention mechanism.
4.5.1. Input Data for scMGCN
For the training of the graph convolutional layer, the , constructed by different graph methods, are used as inputs, where P corresponds to the number of graph construction methods and views. Each input graph comprises the adjacency matrix and feature matrix, obtained as previously described, with the initial feature matrix input being identical.
4.5.2. Graph Convolutional Layer of scMGCN
During training, the graph-structured data are fed into the graph convolutional layer, where nodes receive information from their neighbors and update their representation, leading to learning the nodes’ underlying features. Graphs generated by different graph construction methods are separately fed into the graph convolutional layer, resulting in corresponding node embedding information. The initial layer of a Graph Convolutional Neural Network (GCN) is intended to accept graph data that have undergone preprocessing, and the processing procedure is shown in Equation (
1):
The output of the first layer graph convolution is represented by
, which is obtained by applying a non-linear activation function
to the product of the weight matrix
and the input feature matrix
X. During training, the weight matrix
is updated using stochastic gradient descent.
is the modification of the adjacency matrix for efficient training in the Graph Convolutional Network (GCN). The specific modification method is shown in Equation (
2):
In Equation (
2),
represents the adjacency matrix of the original input graph, a square matrix with dimensions equal to the sum of the number of cells in the reference set and query set.
I denotes the identity matrix with dimensions identical to the adjacency matrix. It represents the one-hot encoded vectors of all nodes in the graph.
D is the diagonal angle matrix of
. As the number of layers in Graph Convolutional Neural Networks (GCN) increases, the model learns higher-order neighbor information and aggregates it into node representations. The specific process is shown in Equation (
3):
Here, l represents the number of graph convolutional layers, and represents the node feature representation output by the lth layer, specifically =X. As the number of graph convolutional layers increases, the learned node embedding becomes more abstract. However, our experiments show no significant improvement in the experimental results when the number of graph convolutional layers exceeds two. On the contrary, it increases the time cost of the program and causes overfitting problems in some datasets. Hence, our experiments are based on a two-layer GCN.
4.5.3. Multi-View Aggregation of scMGCN
Next, we need to fuse the information generated by different graphs. In this paper, we propose an attention-based method to accomplish this task. We take
P sets of node embedding information
, learned in the graph convolutional layers, as input and obtain the weights
for different node embedding information using Equation (
4):
In Equation (
4),
P represents the number of graph construction methods and the number of views and types of edges. The function
denotes a deep neural network capable of performing the attention mechanism.
represents the output of the final layer of the graph convolutional layers, where
h is the output feature dimension of the graph convolutional layers. To learn the importance of each graph, we first perform nonlinear transformations on different embedding information. We then calculate the importance of different embedding information
by measuring the similarity between the transformed embedding information and the attention vector. Additionally, the importance of embedding information for all nodes is averaged. The specific process is shown in Equation (
5):
Here,
v denotes the set of all nodes in graph p,
W represents the weight matrix,
b represents the bias vector,
q represents the attention vector used to calculate similarity, and
represents the embedding information after nonlinear transformation. In order to enable a fair comparison, the variables mentioned above are shared among various node embeddings. We use
to normalize the importance of embedding information for all groups of nodes and obtain the weight of different node embeddings, representing the weight of different graph construction methods. The specific process is shown in Equation (
6):
Here,
represents the weight of different graph construction methods. The higher the
, the more critical the corresponding graph construction method. The weights of different graph construction methods also vary for different types of tasks. Using these weights as coefficients, we can fuse the node embedding information of different graph construction methods to obtain the final embedding result
E. The specific process is shown in Equation (
7):
4.5.4. Result Output of scMGCN
As shown in Equation (
8), after aggregating the node embedding information of multiple views, we use a multilayer perception
to obtain the predicted labels.
We calculate the cross-entropy between the predicted and true cell types and minimize the cross-entropy to calculate the model loss. The loss function of the model is shown in Equation (
9):
Among them, C is the parameter of the Multilayer Perceptron classifier, is the index set of labeled nodes, and is the label of labeled nodes, which is also the cell type. is the final node embedding of labeled nodes. We can optimize the model and learn the node embedding by utilizing backpropagation. The multi-view aggregation output is the embedding vector of nodes, so the loss function is highly adaptable. It can be customized for different tasks, such as link prediction.
The scMGCN model adopts a multi-view approach to transform the problem into a multi-view graph convolutional network. This allows us to integrate cell relationships from different perspectives and improve the accuracy of cell recognition. The result is a more stable and effective cell recognition system.
4.6. Performance Metrics
The model predicts labels for cells, which represent the cell types. In this paper, the primary metrics used to assess the model’s performance in predicting cell labels are the
score and accuracy. Accuracy refers to the proportion of correctly predicted samples out of the total samples.
Equation (
10) represents the calculation process for
, while Equation (
11) represents the calculation process for
. In this context,
stands for false positives,
represents false negatives, and
denotes true positives. In general classification tasks, Precision and Recall are calculated for each sample separately based on the cell type, while accuracy is calculated across all samples. The calculation process of
score is shown in Equation (
12):
Regarding cell type identification, each task is a binary classification task. To calculate the overall score, the scores of negative and positive samples need to be combined. The sub-library metrics in the sklearn module of Python provide two different ways of combining, namely macro and micro. Macro first calculates the precision and recall by cell type, then takes the average of all scores. This approach can reflect the model’s performance in predicting cell types that comprise a small proportion of the total population. Micro, on the other hand, does not explicitly distinguish between types when calculating.