A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data

An, Sijing; Shi, Jinhui; Liu, Runyan; Wang, Jing; Hu, Shuofeng; Dong, Guohua; Ying, Xiaomin; He, Zhen

doi:10.3390/math11244901

Open AccessArticle

A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data

by

Sijing An

,

Jinhui Shi

,

Runyan Liu

,

Jing Wang

,

Shuofeng Hu

,

Guohua Dong

,

Xiaomin Ying

^*

and

Zhen He

^*

Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(24), 4901; https://doi.org/10.3390/math11244901

Submission received: 11 October 2023 / Revised: 27 November 2023 / Accepted: 6 December 2023 / Published: 7 December 2023

(This article belongs to the Special Issue Mathematical Models and Computer Science Applied to Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is essential for characterizing cellular heterogeneity. However, batch information caused by batch effects is often confused with the intrinsic biological information in scRNA-seq data, which makes accurate clustering quite challenging. A Deep Adaptive Clustering with Adversarial Learning method (DACAL) is proposed here. DACAL jointly optimizes the batch correcting and clustering processes to remove batch effects while retaining biological information. DACAL achieves batch correction and adaptive clustering without requiring manually specified cell types or resolution parameters. DACAL is compared with other widely used batch correction and clustering methods on human pancreas datasets from different sequencing platforms and mouse mammary datasets from different laboratories. The results demonstrate that DACAL can correct batch effects efficiently and adaptively find accurate cell types, outperforming competing methods. Moreover, it can obtain cell subtypes with biological meanings.

Keywords:

batch effect correction; clustering analysis; single-cell RNA sequencing; adversarial learning; Dirichlet process; deep learning

MSC:

62P10; 92-08

1. Introduction

scRNA-seq is a technology measuring gene expression at a single-cell level [1]. The analysis of scRNA-seq data facilitates the discovery of biological processes in cells, the investigation of gene regulatory mechanisms that vary by cell subtypes [2], the deciphering of organism development processes [3], and many other biological discoveries. As an essential step in identifying cell types and revealing cellular heterogeneity, the clustering analysis of scRNA-seq data is critical in scRNA-seq data analysis.

However, the clustering analysis of scRNA-seq data is challenging. scRNA-seq data are sometimes compiled from different experiments. With the rapid development of single-cell RNA sequencing technologies, many options exist for different experiments, including different sequencing platforms, instruments, and reagents. Together with different operators and sample sources, all these factors will cause batch effects. The batch information caused by batch effects is often confused with biological information, which makes it challenging to decipher intrinsic biological heterogeneity by clustering analysis. Therefore, the batch effects are big barriers in the clustering analysis of scRNA-seq data.

Several methods have been proposed to perform clustering analysis on multi-batch scRNA-seq data. Most of these methods take batch correction and clustering as two separate steps, which remove batch effects first and then conduct clustering algorithms. Common approaches to alleviating the influence of batch effects include Seurat CCA [4], Seurat RPCA, LIGER [5], and Harmony. Seurat CCA achieves batch correction of scRNA-seq data by combining Mutual Nearest Neighbors (MNN) and diagonal Canonical Correlation Analysis (CCA). CCA can be replaced by Reciprocal PCA (RPCA) to improve efficiency. The LIGER (Linked Inference of Genomic Experimental Relationships) method uses integrative Non-negative Matrix Factorization (iNMF) to achieve the purpose of batch removal. Korsunsky et al. [6] proposed a cluster-guided batch alignment method named Harmony. It uses soft k-means clustering in a shared low-dimensional space of the original count matrix to constrain the batch removal process. In addition, adversarial learning has been used for batch correction in some studies. AD-AE [7] (Adversarial Deconfounding AutoEncoder) is a method based on Autoencoder (AE) [8]. It disentangles confounders from accurate signals and retains the accurate ones as the low-dimensional latent variable without batch effect. After batch correction, clustering analysis is performed to identify cell clusters using methods, such as Louvain [9] and Leiden [10].

However, since biological information is mixed with batch information; removing batch effects separately may impair biological information in scRNA-seq data and affect the performance of the clustering analysis. These methods might wrongly deplete or enrich certain cell types or accumulate errors in the training, leading to unsatisfactory clustering results [11].

A few methods have been presented to remove batch effects jointly with clustering analysis. The scCAEs algorithm combines convolutional autoencoder and soft k-means clustering [12]. The iterative training via Kullback–Leibler Divergence can ensure convergence while balancing biological and technical differences between clusters. It can remove batch effects while achieving clustering analysis. Li et al. [11] proposed the SCDC algorithm with a similar autoencoder structure. SCDC extracts biological and batch information from the original scRNA-seq data using two encoders and uses a soft clustering method on the low-dimensional representations of biological information. Compared with the methods that perform batch correction and clustering separately, these methods can perform much better [11,12]. However, these methods require users to specify the number of cell types (or resolution parameters) or to infer the parameters by traversing multiple values based on some metrics. Arbitrarily specifying cluster numbers often misguides clustering results, while traveling for optimal cluster numbers often requires significant computing resources. The methods that support batch effect removal and highly efficient adaptive clustering jointly are, therefore, in great need.

In this study, a deep adaptive clustering with an adversarial learning method is proposed. DACAL uses the AE architecture. It contains three modules: the AE module for dimensionality reduction, the Dirichlet Process Mixture Model (DPMM) [13] module for clustering, and the Adversarial module for batch correction. To achieve accurate analysis results, all the parameters of these modules are jointly optimized. In this study, DACAL is introduced in detail, and the robustness of the hyperparameter change in DACAL is evaluated using a simulated dataset. The performance of DACAL and other widely used batch correction and clustering methods are also evaluated on human pancreas datasets from different sequencing platforms and mouse mammary datasets from different laboratories. The results demonstrated that DACAL can adaptively obtain batch correction and clustering results and perform better than other methods. Moreover, DACAL can obtain meaningful subtypes for the input of scRNA-seq datasets containing multiple batches.

2. Materials and Methods

2.1. The Architecture of DACAL

DACAL is a deep batch correction and adaptive clustering algorithm for scRNA-seq data. The input of DACAL consists of two parts: a gene expression matrix

X {\in N}^{N \times M}

, which contains

N

cells and

M

genes, and a vector

s {\in N}^{N}

, which is composed of the corresponding batch IDs of these cells. The output of DACAL is the cluster label

y {\in N}^{N}

.

The AE module is used to reduce the dimensionality of the input

X

. The latent variables

C

of the biological information and

B

of the batch information are obtained.

C

and

B

are used for inferring the parameter

Γ

of the distribution of

X

, while

B

is used for obtaining the parameter

Π

of the distribution of batch IDs. To achieve clustering analysis for the input data,

C

is used in the DPMM module. The predicted labels

y

and the prior distribution to constrain

C

are obtained. To remove the batch effect,

C

is also used in the Adversarial module for inferring the Categorical distribution of batch IDs, which can lead to adversarial learning with the generation of

C

in the AE module. The two processes are balanced via adversarial learning to achieve the distinction between

C

and

B

. The workflow of DACAL is shown in Figure 1.

2.1.1. The AE Module

The input of the AE module is

X

and

s

, where

x_{n} \in N^{M}

and

s_{n} \in N

are the

n

-th components of

X

and

s

, which shows the gene expression and batch ID of cell

n,

respectively.

s_{n}

is converted into the one-hot encoding

G

-dimensional vector

η_{n}

, where

G

is the number of batches in the dataset. The spliced

x_{n}

and

η_{n}

are converted into the low-dimensional representation

z_{n}

using the encoder network

f

which is parameterized by

ϕ

.

z_{n}

consists of two parts: the low-dimensional latent variable

c_{n} \in R^{D^{c}}

of the biological information and the low-dimensional variable

b_{n} \in R^{D^{b}}

of the batch information.

f^{c}

is the encoder network for

c_{n}

and

f^{b}

is the encoder network

b_{n}

(

f = {f^{c}, f^{b}}

). The decoder network

g^{θ}

, with parameter

θ

, predicts the distribution of

x_{n}

conditioned on

z_{n}

, which is assumed to be a Poisson distribution parameterized by

γ_{n}

(

Γ

= [

γ_{1}; γ_{2}; \dots; γ_{N}

]). The other decoder

g^{ξ}

parameterized by

ξ

can predict the distribution of

s_{n}

conditioned on

b_{n}

, which is assumed to be a Categorical distribution parameterized by

π_{n}

, (

Π = [π_{1}; π_{2}; \dots; π_{N}

]). The loss of AE

l_{A E}

is defined as the reconstruction loss of

x_{n}

and

s_{n}

. The loss consists of two parts: the negative log-likelihood of

γ_{n}

given

x_{n}

, and the negative log-likelihood of

π_{n}

given

s_{n}

, i.e.,:

\begin{matrix} - \ln p (x_{n}| γ_{n}) & = - \ln p (x_{n} | g^{θ} (z_{n}; θ)) \\ = - \ln P o i s s o n (x_{n}| g^{θ} (f (x_{n}, s_{n}; ϕ); θ) \\ - \ln p (s_{n} | π_{n}) & = - \ln p (s_{n} | g^{ξ} (b_{n}; ξ)) \\ = - \ln C a t e g o r i c a l (s_{n} | g^{ξ} (f^{b} (x_{n}, s_{n}; ϕ); ξ)) \\ - \ln p (x_{n}| γ_{n}) - \ln p (s_{n} | π_{n}) & ≜ l_{A E} (ϕ, θ, ξ; x_{n}, s_{n}) \end{matrix}

(1)

2.1.2. The DPMM Module

The input of the DPMM module is the latent variable

C

of the biological information, where

c_{n}

is the

n

-th component in

C

. The prior distribution provided by the DPMM module can constrain

c_{n}

and infer the clustering label

y_{n}

of cell

n

. In this module,

c_{n}

is assumed to be generated from the Dirichlet process mixture model, and the loss function of this module is defined as the negative log-likelihood of

c_{n}

and the distribution determined by DPMM [14]. The loss function

l_{D P M M}

is defined as the following:

\begin{matrix} - \ln p (c_{n}, G) & = - \ln p (f^{c} (x_{n}, s_{n}; ϕ), G) \\ ≜ l_{D P M M} (ϕ, G; x_{n}, s_{n}) \end{matrix}

(2)

where

G

is the set of latent variables in DPMM. The optimization of

l_{D P M M} (ϕ, G; x_{n}, s_{n})

maximizes the item

\ln p (c_{n}, G)

, which has been given in [15,16]. In this module, as shown in [16], there is a hyperparameter

α

in the Dirichlet process, and the larger

α

leads to the more uniform distribution of weights, which means the greater number of clusters.

2.1.3. The Adversarial Module

The latent variable

C

is the input of the Adversarial module. The module predicts the parameter

π_{n}^{D}

of the Categorical distribution of

s_{n}

. The discrimination of

s_{n}

using the Adversarial module and the generation of

c_{n}

using the AE module are adversarially learned, which can help to separate the information in low-dimensional representation into two parts: biological information in

c_{n}

and batch information in

b_{n}

. By using adversarial learning, the Adversarial module cannot determine the correct batch ID from

c_{n}

, which means that

c_{n}

does not contain batch information anymore, thus completing batch correction.

There are two loss functions in this module:

l_{d}

is the negative log-likelihood of

s_{n}

given

c_{n}

, which is used for updating the parameter

Ω

in the Adversarial module; the other is the log-likelihood of

s_{n}

given

c_{n}

, which is used for adversarial learning and updating the parameter

ϕ

in the AE module.

l_{d}

and

l_{a}

are as the following:

\begin{matrix} - \ln p_{Ω} (s_{n}| c_{n}) & = - \ln p_{Ω} (s_{n}| f^{c} (x_{n}, s_{n}; ϕ)) \\ ≜ l_{d} (Ω; x_{n}, s_{n}, ϕ) \\ \ln p_{Ω} (s_{n}| c_{n}) & = \ln p_{Ω} (s_{n}| f^{c} (x_{n}, s_{n}; ϕ)) \\ ≜ l_{a} (ϕ; x_{n}, s_{n}, Ω) \end{matrix}

(3)

2.1.4. Loss Function

The loss function of updating the parameters of encoders and decoders in the DACAL contains the loss of all the modules by using Equations (1)–(3), that is:

l (ϕ, θ, ξ, G; x_{n}, s_{n}, Ω) = l_{A E} (ϕ, θ, ξ; x_{n}, s_{n}) + l_{D P M M} (ϕ, G; x_{n}, s_{n}) + l_{a} (ϕ; x_{n}, s_{n}, Ω)

(4)

To minimize the loss function

l (ϕ, θ, ξ, G; x_{n}, s_{n}, Ω)

, an iterative training strategy is used to optimize the parameters of the AE, DPMM, and Adversarial modules alternatively. The training algorithm of DACAL is detailed in Algorithm 1.

Algorithm 1 The DACAL training algorithm

Input: A scRNA-seq dataset

{\{x_{n}, s_{n}\}}_{n \in N}

Output: The cluster label

y

1: for

t = 1,2, \dots, T

do:

2: Sample a mini-batch

{\{x_{n}, s_{n}\}}_{n \in N_{t}}

from the dataset, where

N_{t} \subset N

3: for Step

k = 1,2, \dots, K

do:

4:

Freeze ϕ

and update

Ω

with loss

\frac{1}{| N_{t} |} \sum_{N_{t} \subset N} l_{d} (Ω; x_{n}, s_{n}, ϕ)

7: end for

8 : Freeze Ω

and

G, update ϕ, θ, ξ

with loss

\frac{1}{| N_{t} |} \sum_{N_{t} \subset N} l (ϕ, θ, ξ, G; x_{n}, s_{n}, Ω)

9: if

t m o d S = 0

then:
10: Freeze

ϕ

and update

G

with loss

\frac{1}{|N_{t}|} \sum_{N_{t} \subset N} l_{D P M M} (ϕ, G; x_{n}, s_{n})

11: end if

12: end for

*

T

is the total number of iterations,

S

is the number of iterations in each epoch, and

K

is the number of updates for

Ω

in each iteration.

Algorithm 1 shows the training process of DACAL. The parameters of the AE, DPMM, and Adversarial modules are alternatively optimized to minimize the loss function

l (ϕ, θ, ξ, G; x_{n}, s_{n}, Ω)

. Specifically, first, the parameter of encoder network

f

in the AE module is frozen to update the parameter in the Adversarial module. Then, the parameters in the DPMM and Adversarial modules are frozen, and the parameters in the AE module are updated to remove the batch information in the low-dimensional representation. Finally, the parameters in the AE and Adversarial modules are frozen, and the parameters in the DPMM module are updated to obtain the clustering result.

2.2. Training the DACAL Model

The DACAL model is implemented using PyTorch [17] (v.1.12.0) in Python 3 (v.3.7.11). The BayesianGaussianMixture function using scikit-learn (v.1.0.2) is used in the DPMM module. The number of mixture components in the BayesianGaussianMixture function is set to 50, where 50 is the maximum number of cell types allowed in the model and far more than the true number of cell types. The parameter details of DACAL are as follows: the two hidden layer sizes of the encoder

f

are set to 512 and 128, the hidden layer size of the decoder

g^{θ}

to 128 and 512, the hidden layer sizes of the decoder

g^{ξ}

to 128 and 64, and the hidden layer sizes of the discriminator to 128 and 64. The dimensionality of

c_{n}

is set to 32 and

b_{n}

to 2. The number of updates used in

l_{d} (Ω; x_{n}, s_{n}, ϕ)

is 10. During the training process, the minibatch size is set to 512, and the AdamW optimizer [18] with a fixed learning rate of 1 × 10⁻⁴ is used for optimization. The hyperparameter

α

is set to 1 × 10⁻¹⁰ except for the experiment of the robustness of hyperparameter change. The AE and Adversarial modules are pre-trained during training until a stable low-dimensional representation is obtained; the entire model is then trained.

2.3. Datasets and Preprocessing

In order to demonstrate the performance of DACAL on multiple batches of scRNA-seq data, three datasets are used in this paper, including a simulated dataset and two real datasets. The simulated dataset is constructed as follows: assume that the data follows a Gaussian mixture distribution. The mean and log of variance are sampled from the Gaussian distribution, respectively. The weights are sampled from the Dirichlet distribution. The simulated dataset contains 32,000 cells from 4 batches with 10 cell types, available at https://github.com/omicshub/DACAL (accessed on 5 December 2023).

This article uses two real datasets from typical batch sources: the human pancreas cell dataset from 5 sequencing protocols and the mouse mammary cell dataset from 3 laboratories. The human pancreas cell dataset contains 14,890 cells from 8 batches with 13 cell types; the mouse mammary cell dataset contains 9288 cells from 3 batches including 3 cell types. These datasets can be obtained from https://figshare.com/articles/dataset/Batch_Alignment_of_single-cell_transcriptomics_data_using_Deep_Metric_Learning/20499630/2?file=36699252 (accessed on 5 December 2023) [19].

The Seurat package (v.4.1.0) [17] is used to preprocess the data. Specifically, UMI (Unique Multiple Index) count matrices are normalized and log-transformed using the NormalizeData function in Seurat; then, 4000 highly variable genes are selected using the FindVariableFeatures function; at last, the selected genes are scaled to obtain the gene expression matrix using the ScaleData function. In addition, batch information of the data needs to be extracted to obtain the batch IDs vector. Gene expression matrix and batch IDs vector are used together as different inputting methods.

2.4. Comparing Methods and Evaluation Metrics

DACAL is compared with commonly used batch correction and clustering methods, which includes combinations of four widely used batch correction methods and four commonly used cluster analysis methods, as well as a joint batch correction and clustering method scCAEs. The 4 batch correction methods are Harmony, LIGER, Seurat CCA, and Seurat RPCA. The input of these methods are gene expression matrix and batch IDs vector, which are the same as those of DACAL. The dimensionality of low-dimensional representation is set to 32 and implemented with the original code and the default hyperparameters in the experiments. The four clustering correction methods are the Louvain algorithm, the Louvain algorithm with multilevel refinement (MR Louvain), the Smart Local Moving (SLM) algorithm, and the Leiden algorithm. The input of these clustering methods is the output of the batch correction methods. In the experiments of this paper, these clustering methods are implemented using the Seurat (v.4.1.0) package with default parameters. The experiments on scCAEs use the original code and default hyperparameters; the hyperparameter of cluster numbers of scCAEs is set to the number of actual labels in each experiment.

In addition, PCA dimensionality reduction and four clustering methods are combined without batch correction method as a raw experiment to demonstrate the batch effect in the datasets.

Five metrics are used to evaluate the performance of these methods, i.e., four classical clustering performance indicators and one indicator defined by us, i.e., Normalized Mutual Information (NMI) [20], Adjusted Rand Coefficient (ARI) [21], F1 score [22], Silhouette Coefficient (SC) [23], and Deviation Ratio (DR). These classical indicators are implemented by using scikit-learn. As the metric scores of NMI, ARI, and F1 score range from 0 to 1, the SC score is scaled to

0 - 1

for calculating the average score with NMI and ARI, where 0 and 1 correspond to disappointing and ideal results of the method, respectively.

To evaluate the degree of deviation between the number of predicted labels and the number of true ones, a DR indicator is put forward, that is:

D R = \frac{h_{p r e d i c t e d} - h_{t r u e}}{h_{t r u e}}

(5)

where

h_{p r e d i c t e d}

is the number of predicted labels,

h_{t r u e}

is the number of true ones, respectively. The larger absolute value indicates a more significant deviation, and vice versa.

3. Results

3.1. DACAL Is Robust to Hyperparameter Changes on scRNA-seq Data

To verify the robustness of the hyperparameter

α

change in DACAL, the experiment by setting the hyperparameter

α

to 8 different values in the wide range of [1 × 10⁻²⁰⁰, 1 × 10⁻⁰] was designed. As shown in Figure 2a, the numbers of clusters obtained by DACAL are always very close to the numbers of the true one as parameter

α

increased, suggesting that DACAL can adaptively obtain accurate cluster numbers. As shown in Figure 2b, the NMI, ARI, SC, and F1 scores, and their average were also computed. The indicators are all greater than 0.7, and the average scores are greater than 0.8, suggesting the DACAL’s robustness of hyperparameter changes.

3.2. DACAL Can Jointly Remove Batch Effect and Cluster Adaptively on scRNA-seq Data

To verify the performance of the DACAL on the scRNA-seq data, DACAL and the other competing methods were applied to the human pancreatic cell dataset. The comparison methods include scCAES and the combinations of four widely used batch correction methods (Harmony, LIGER, Seurat CCA, and Seurat RPCA) and four commonly used cluster analysis methods (Louvain, MR Louvain, SLM, and Leiden). In addition, the combinations of PCA and four clustering methods are also demonstrated as the baseline.

The low-dimensional representations obtained by the above methods are visualized through UMAP and labeled with the true cell types, batch labels, and predicted labels, respectively. The UMAP plots of DACAL show that the cells in different batches mix well and the predicted labels accord the ground truth well (Figure 3a, left). Compared to DACAL, other competing methods either did not remove batch effects efficiently or did not predict cell-type labels accurately (Figure 3a, middle). The PCA dimensionality reduction result marked with the Batch label is depicted as a baseline, which shows that the batch effect is evident in this data set (Figure 3a, right). These results demonstrate that DACAL outperforms all the other batch correction and clustering methods.

In order to quantitatively measure the similarity degree between the cluster labels obtained by different methods and the true cell type labels, the indicators NMI, ARI, SC, and F1 scores and DR were calculated. Figure 3b shows the NMI, ARI, SC, and F1 scores and their average of different methods. DACAL obtained the highest scores in all four indicators, indicating that its clustering performance is better than the other 21 methods. The DR scores were also calculated to compare the deviation between the number of predicted labels from different methods and the number of true ones in this dataset. As shown in Figure 3c, the result of DACAL has the lowest DR score in different methods, which means that the predicted number of clusters is the closest to the number of cell types.

To show the relationship between the actual cell type label and the predicted label obtained by different methods, the Sankey plots are used to demonstrate the match between these labels. As shown in Supplementary Figure S1, the predicted labels by DACAL match the true label the best.

3.3. DACAL Can Provide Fine-Grained Clusters on scRNA-seq Data with Batch Effect

When comparing DACAL with different batch correction and clustering methods on the mouse mammary cell dataset, DACAL obtained far better results in NMI, ARI, SC, and F1 scores and DR than the competing methods (Figure 4a,b). It should be noted that the DR score of DACAL on this dataset is 2, which is a very high value. The high positive deviation indicates remarkably more clusters than the ground truth. The reasons for more predicted clusters by DACAL than true labels should be investigated.

To show the relationship between the actual cell type label and the predicted label obtained by DACAL, the Sankey plot demonstrated the match between these labels. As shown in Figure 4c, DACAL subdivides luminal progenitor and luminal mature into two clusters, respectively. Luminal mature cells are divided into cluster 2 and cluster 27. The literature shows that cells with a high expression of Prlr and S100a6 are luminal mature cells related to hormone changes [24]. The marker genes of these two clusters were analyzed to infer the meaning of the subtypes represented by cluster 2 and cluster 27. Figure 4d is a partial enlargement of the UMAP visualization of these two clusters marked by the genes of Prlr and S100a6. Compared to cluster 27, Prlr and S100a6 are specifically highly expressed in cluster 2. Cluster 27 is other luminal mature cells. The Sankey plots of other methods are shown in Supplementary Figure S2.

To determine why the luminal progenitor was divided into these two clusters, the marker genes of cluster 0 and cluster 6 are analyzed. Bhupinder et al. demonstrated that the early luminal progenitor subset can be marked by a high expression of the CD55 gene [25]. By calculating the average expression value of gene CD55 in cluster 0 and cluster 6, its expression level in cluster 0 is found to be twice that in cluster 6. Therefore, cluster 0 can be inferred as an early luminal progenitor subset through the specific high expression of CD55, and cluster 6 is a luminal progenitor subset at other stages. These results show that DACAL can achieve good batch correction and adaptive clustering results and adaptively identify accurate subtypes with biological meanings.

3.4. Comparisons of Runtime and Memory Usage of DACAL and Other Methods

The computing time and memory consumed by DACAL and 21 other approaches were compared on a SITONHOLY Linux server (Tianjin, China) with 72 CPU cores of two Intel Xeon Gold 6240 chips (Santa Clara, CA, USA), 503 GB of RAM, and NVIDIA RTX 2080 Ti GPUs (Santa Clara, CA, USA).

As shown in Figure 5a,b, the runtime of DACAL is higher than most other methods except scCAEs, where the mouse mammary cell dataset cannot provide the required number of genes for scCAEs, so the corresponding results are not obtained. DACAL and scCAEs are methods based on deep learning, which takes multiple epochs of training to learn stable model parameters and network parameters and usually takes longer than non-depth methods. However, when comparing the CPU and GPU requirements, DACAL needs less memory than all the comparing methods.

Notably, the clustering performance of DACAL is outstanding compared to these methods. Therefore, considering computing time, memory, and performance, DACAL is a very advantageous method.

4. Discussion

The joint batch effect correction and adaptive clustering methods are urgently needed and are still quite challenging in scRNA-seq data analysis. A deep DACAL model based on adversarial learning is proposed to solve this problem. The model consists of three modules: the AE module for nonlinear dimensionality reduction, the Adversarial module for adversarial learning to remove batches, and the DPMM module for clustering. Joint batch effect correction and adaptive clustering of multiple batches of scRNA-seq data are achieved by optimizing all modules. The comparisons of the two scRNA-seq dataset results show that DACAL outperforms other competing methods. In addition, DACAL can accurately identify fine-grained clusters corresponding well to biological meanings.

The ability to adaptively find accurate cluster numbers is novel and meaningful for batch effect correction and clustering methods. The results of commonly used methods are affected by the number of specified clusters, and arbitrarily setting cluster numbers often misguides results. DPMM is employed to realize the clustering function in the batch correction and clustering method. DPMM can automatically adjust the number of clusters based on data complexity. The results are obtained by using DPMM in the joint batch correction and clustering method without the need for cluster numbers. DACAL is the first method that jointly performs highly efficient batch correction and accurate adaptive clustering for scRNA-seq data.

As the transcriptome is not the sole determinant of cell type, state, and function, the analysis of single-cell sequencing data in different omics fields has provided possibilities for interpreting cellular heterogeneity from various perspectives. Moreover, the analysis of single-cell multi-omics sequencing data can help researchers obtain more comprehensive observational information and also aid in understanding the correspondence between different omics datasets. Our target in future studies is to extend the batch effect correction and adaptive clustering method to single-cell sequencing data from other omics or multi-omics.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math11244901/s1. Figure S1. The Sankey plots of different methods on the human pancreas cell dataset. Figure S2. The Sankey plots of different methods on the mouse mammary cell dataset.

Author Contributions

Conceptualization, Z.H.; performing the experiments, S.A. and J.S.; analysis of the data, S.A. and R.L.; reagents/materials/analysis tool contribution: J.W., S.H. and G.D.; writing—original draft preparation, S.A.; writing—reviewing and editing, X.Y. and Z.H.; project administration, X.Y.; funding acquisition, X.Y. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (Grant No. 2022YFF1202400 and No. 2017YFC0908300) and the National Natural Science Foundation of China (grant number 62303488).

Data Availability Statement

The source code of DACAL is available at https://github.com/omicshub/DACAL (accessed on 5 December 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bacher, R.; Kendziorski, C. Design and Computational Analysis of Single-Cell RNA-Sequencing Experiments. Genome Biol. 2016, 17, 63. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Mizuno, T.; Sridharan, A.; Du, Y.; Guo, M.; Tang, J.; Wikenheiser-Brokamp, K.A.; Perl, A.-K.T.; Funari, V.A.; Gokey, J.J.; et al. Single-Cell RNA Sequencing Identifies Diverse Roles of Epithelial Cells in Idiopathic Pulmonary Fibrosis. JCI Insight 2016, 1, e90558. [Google Scholar] [CrossRef] [PubMed]
Briggs, J.A.; Weinreb, C.; Wagner, D.E.; Megason, S.; Peshkin, L.; Kirschner, M.W.; Klein, A.M. The Dynamics of Gene Expression in Vertebrate Embryogenesis at Single-Cell Resolution. Science 2018, 360, eaar5780. [Google Scholar] [CrossRef] [PubMed]
Stuart, T.; Butler, A.; Hoffman, P.; Hafemeister, C.; Papalexi, E.; Mauck, W.M.; Hao, Y.; Stoeckius, M.; Smibert, P.; Satija, R. Comprehensive Integration of Single-Cell Data. Cell 2019, 177, 1888–1902.e21. [Google Scholar] [CrossRef]
Welch, J.D.; Kozareva, V.; Ferreira, A.; Vanderburg, C.; Martin, C.; Macosko, E.Z. Single-Cell Multi-Omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 2019, 177, 1873–1887. [Google Scholar] [CrossRef] [PubMed]
Korsunsky, I.; Millard, N.; Fan, J.; Slowikowski, K.; Zhang, F.; Wei, K.; Baglaenko, Y.; Brenner, M.; Loh, P.-R.; Raychaudhuri, S. Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony. Nat. Methods 2019, 16, 1289–1296. [Google Scholar] [CrossRef]
Dincer, A.B.; Janizek, J.D.; Lee, S.-I. Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings. Bioinformatics 2020, 36, i573–i582. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. Theory Exp. 2008, 2008, 10008. [Google Scholar] [CrossRef]
Traag, V.A.; Waltman, L.; van Eck, N.J. From Louvain to Leiden: Guaranteeing Well-Connected Communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef]
Li, Y.; Lin, Y.; Hu, P.; Peng, D.; Luo, H.; Peng, X. Single-Cell RNA-Seq Debiased Clustering via Batch Effect Disentanglement. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–11. [Google Scholar] [CrossRef]
Hu, H.; Li, Z.; Li, X.; Yu, M.; Pan, X. ScCAEs: Deep Clustering of Single-Cell RNA-Seq via Convolutional Autoencoder Embedding and Soft K-Means. Brief. Bioinform. 2022, 23, bbab321. [Google Scholar] [CrossRef]
Antoniak, C.E. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. Ann. Stat. 1974, 2, 1152–1174. [Google Scholar] [CrossRef]
Zhao, T.; Wang, Z.; Masoomi, A.; Dy, J. Deep Bayesian Unsupervised Lifelong Learning. Neural Netw. 2022, 149, 95–106. [Google Scholar] [CrossRef] [PubMed]
Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Blei, D.M.; Jordan, M.I. Variational Inference for Dirichlet Process Mixtures. Bayesian Anal. 2006, 1, 121–143. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Vancouver, BC, Canada, 2019; Volume 32. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization 2019. Available online: https://openreview.net/forum?id=Bkg6RiCqY7 (accessed on 5 December 2023).
Yu, X.; Xu, X.; Zhang, J.; Li, X. Batch Alignment of Single-Cell Transcriptomics Data Using Deep Metric Learning. Nat. Commun. 2023, 14, 960. [Google Scholar] [CrossRef] [PubMed]
Strehl, A.; Ghosh, J. Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Hubert, L.; Arabie, P. Comparing Partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Powers, D. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Bach, K.; Pensa, S.; Grzelak, M.; Hadfield, J.; Adams, D.J.; Marioni, J.C.; Khaled, W.T. Differentiation Dynamics of Mammary Epithelial Cells Revealed by Single-Cell RNA Sequencing. Nat. Commun. 2017, 8, 2128. [Google Scholar] [CrossRef] [PubMed]
Construction of Developmental Lineage Relationships in the Mouse Mammary Gland by Single-Cell RNA Profiling—PubMed. Available online: https://pubmed.ncbi.nlm.nih.gov/29158510/ (accessed on 30 September 2023).

Figure 1. The overview of the DACAL framework. The DACAL model contains three parts: the AE module, the DPMM module, and the Adversarial module. All the parameters of these modules are jointly optimized. Using the gene expression matrix

X

and batch IDs

s

of the input data, the low-dimensional representation

C

is obtained using Encoder for

C

, and low-dimensional representation

B

using Encoder for

B

of the AE module, where

C

is the latent variables of the biological information, and

B

is the latent variables of the batch information.

C

and

B

are converted to the parameter

Γ

in the decoder for

Γ

, where

Γ

is the parameter of the distribution of

X

.

B

is converted to

π

in the decoder for

Π

.

C

is used in the Adversarial module. The loss

l_{d}

is used for updating the Adversarial module and

l_{a}

is used for updating the AE module.

C

is also used in the DPMM module. The loss of

l_{D P M M}

in DPMM is used to update the DPMM module and constrain the AE module.

Figure 1. The overview of the DACAL framework. The DACAL model contains three parts: the AE module, the DPMM module, and the Adversarial module. All the parameters of these modules are jointly optimized. Using the gene expression matrix

X

and batch IDs

s

of the input data, the low-dimensional representation

C

is obtained using Encoder for

C

, and low-dimensional representation

B

using Encoder for

B

of the AE module, where

C

is the latent variables of the biological information, and

B

is the latent variables of the batch information.

C

and

B

are converted to the parameter

Γ

in the decoder for

Γ

, where

Γ

is the parameter of the distribution of

X

.

B

is converted to

π

in the decoder for

Π

.

C

is used in the Adversarial module. The loss

l_{d}

is used for updating the Adversarial module and

l_{a}

is used for updating the AE module.

C

is also used in the DPMM module. The loss of

l_{D P M M}

in DPMM is used to update the DPMM module and constrain the AE module.

Figure 2. The performance of DACAL on the simulated dataset with the hyperparameter changes. (a) The comparison between the cluster numbers obtained by DACAL and true labels under different values of

α

. (b) The NMI, ARI, SC, and F1 scores and their average of DACAL under different values of

α

.

Figure 2. The performance of DACAL on the simulated dataset with the hyperparameter changes. (a) The comparison between the cluster numbers obtained by DACAL and true labels under different values of

α

. (b) The NMI, ARI, SC, and F1 scores and their average of DACAL under different values of

α

.

Figure 3. Batch correction and clustering results of different methods on the human pancreas cell dataset. (a). The UMAP visualization of low-dimensional representation obtained by different methods is marked with real cell type, batch, and predicted labels, respectively. Each column represents a method, and each row represents the results marked with a different label. The first row represents the results of real cell type label annotation, the second row represents the results of batch label annotation, and the third row (to the last row) describes the results of predicted label annotation obtained by the corresponding clustering method. The batch correction methods, such as Harmony, are combined with different clustering methods, so these methods contain four rows of predicted label annotation results obtained by different clustering methods. The low-dimensional representation of the scCAEs is the PCA dimensionality reduction result, which was suggested by the authors in the original code. (b). NMI, ARI, SC, and F1 scores and the mean scores of different methods. Each method is depicted in a row of the plot. The four circles and the rectangle represent the NMI, ARI, SC, and F1 scores and the average score, respectively, whose sizes represent the score, and color represents the score ranking. The methods are ranked in descending order of average score. (c). Bar plot of the Deviation Ratio of different methods. The horizontal axis represents different methods, and the vertical axis is the Deviation Ratio. Shorter bars indicate better results. scCAEs does not participate in DR comparison because its input contains the number of real cell types.

Figure 4. Batch correction and clustering results of different methods on the mouse mammary cell dataset. (a) NMI, ARI, SC, and F1 scores and the mean scores of different methods. Each method is depicted in a row of the plot. The four circles and the rectangle represent the NMI, ARI, SC, and F1 scores and the average score, respectively, whose sizes represent the score, and color represents the score ranking. The methods are ranked in descending order of average score. The methods ranked in descending order of average score. The dataset cannot provide the required number of genes for scCAEs, so the corresponding results are not depicted in the figure. (b). Bar plot of Deviation Ratio of different methods. The horizontal axis represents different methods, and the vertical axis is the Deviation Ratio. Shorter bars indicate better results. (c). The Sankey plot of DACAL shows the correspondence between the predicted labels and the ground truth. (d). The feature plots of Prlr and S100a6 in cluster 2 and cluster 27.

Figure 5. Comparison of the runtime and memory costs of different methods. (a) Runtime of different methods on the human pancreas cell dataset; (b) runtime of different methods on the mouse mammary cell dataset; (c) memory costs of different methods on the human pancreas cell dataset; (d) memory costs of different methods on the mouse mammary cell dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

An, S.; Shi, J.; Liu, R.; Wang, J.; Hu, S.; Dong, G.; Ying, X.; He, Z. A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data. Mathematics 2023, 11, 4901. https://doi.org/10.3390/math11244901

AMA Style

An S, Shi J, Liu R, Wang J, Hu S, Dong G, Ying X, He Z. A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data. Mathematics. 2023; 11(24):4901. https://doi.org/10.3390/math11244901

Chicago/Turabian Style

An, Sijing, Jinhui Shi, Runyan Liu, Jing Wang, Shuofeng Hu, Guohua Dong, Xiaomin Ying, and Zhen He. 2023. "A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data" Mathematics 11, no. 24: 4901. https://doi.org/10.3390/math11244901

APA Style

An, S., Shi, J., Liu, R., Wang, J., Hu, S., Dong, G., Ying, X., & He, Z. (2023). A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data. Mathematics, 11(24), 4901. https://doi.org/10.3390/math11244901

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Joint Batch Correction and Adaptive Clustering Method of Single-Cell Transcriptomic Data

Abstract

1. Introduction

2. Materials and Methods

2.1. The Architecture of DACAL

2.1.1. The AE Module

2.1.2. The DPMM Module

2.1.3. The Adversarial Module

2.1.4. Loss Function

2.2. Training the DACAL Model

2.3. Datasets and Preprocessing

2.4. Comparing Methods and Evaluation Metrics

3. Results

3.1. DACAL Is Robust to Hyperparameter Changes on scRNA-seq Data

3.2. DACAL Can Jointly Remove Batch Effect and Cluster Adaptively on scRNA-seq Data

3.3. DACAL Can Provide Fine-Grained Clusters on scRNA-seq Data with Batch Effect

3.4. Comparisons of Runtime and Memory Usage of DACAL and Other Methods

4. Discussion

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI