Overview of the architecture of scBalance
scBalance provides an integrative deep learning framework to perform accurate and fast cell-type annotation, especially on rare cell types, in a scalable manner (Fig. 1). The structure of the scBalance includes two parts, a weight sampling technique that adapts to imbalanced scRNA-seq datasets, and a sparse neural network that efficiently annotates cell types.
First, different from all existing tools, we use a specially designed weight sampling technique to adaptively process the imbalanced scRNA-seq dataset. Unlike exsiting methods that use synthetic-based technique33,34, our method incorporates the balancing technique into training batches so that will not generate new points, thus can save memory space and speeding up training. This design is particularly useful for the atlas-scale dataset, where generating new dataset points is impractical. In scBalance, to keep as much information as possible and avoid a huge training time cost, we randomly over-sample the rare populations (minority classes) as well as under-sample the common cell types (majority classes) in each training batch (Fig. 1a, Step 1). The sampling process is done with replacement, and the sampling ratio is adaptive for different reference datasets, defined as the cell-type proportions of the true label provided by the reference set. This minimizes overfitting in the oversampling, thus maintaining a promising performance of the generalization ability of scBalance. Meanwhile, regarding the enormous overlapping expression information in the common populations, the under-sampling of the major class enables scBalance to use a relatively small training size with an abundance of training information. Leveraging this design, scBalance yields an exceptional performance in learning features of rare cell types as well as maintains a strong ability in classifying all major cell types, thus also improving its overall annotation accuracy. To testify to the performance of our internal sampling method, we benchmarked it with popularly used balancing techniques such as simple oversampling and downsampling as well as Synthetic Minority Over-sampling Technique (SMOTE). The results show that our internal balancing method improves classification accuracy compared with simple over- and downsampling and also outperforms the synthetic method SMOTE (Fig. 1c and Supplementary Fig. 1). Notably, our method provides a faster and space-saving balancing solution compared with normally used balancing methods (Fig. 1d and Supplementary Fig. 2a, b, and Supplementary Data 1). Because our method is coupled with the training process, it will not need to generate new data points, thus saving time and memory space. Additionally, scBalance also provides an interface for users who would like to explore specific minor cell types in a more detailed granularity. It allows datasets processed by external sampling methods such as scSynO34. In this case, only scBalance classifier will be used.
Moreover, we notice that the reference dataset and the prediction dataset can be generated by different sequencing platforms and protocols such as the 10X platform and Smart-seq platform, thus will naturally introduce different noises such as gene detection dropouts and random sequencing error35. To address this issue, scBalance considers random noise as a type of overfitting event and implements the dropout36 technique to mitigate this problem. The dropout layer, due to its excellent capacity of reducing overfitting, also enhances the learning ability of the scBalance to the resampled minor cell types. Additionally, scBalance provides a network reusing option for atlas-scale training scenario, enabling users to avoid the significant time cost of training the model again for the same dataset (Fig. 1a, Step 3).
Taken together, scBalance provides a three hidden layers network structure with a batchnorm and dropout setting in each layer. The activation function is set as an exponential linear unit (ELU)37 and the output layer uses Softmax. In the training mode (Fig. 1a, Step 2), units in the hidden layer are randomly disabled to help reduce the influence of noises on the training process. In the predicting mode, the network will be set as a fully connected status to keep all parameters being used in the forward process. The model evaluation and backpropagation are based on the cross-entropy loss function and Adam optimizer. To speed up the training and predicting process, scBalance also includes a graphics processing unit (GPU) mode which reduces the running time of the classifier by 25–30%. Overall, scBalance is well-designed to handle different types of noises and imbalanced datasets while achieving high classification accuracy for rare and major cell types.
scBalance accurately identifies rare cell population in the intra-dataset labeling task
We first demonstrated the rare cell-type identification ability of scBalance in the baseline test. To evaluate performance, we used twelve scRNA-seq datasets with different imbalance degrees and different cell numbers, which were divided into train sets and test sets. To ensure a more comprehensive test, most of the datasets are generated from different sequencing platforms (see “Methods” and Table 1). The true label information of these datasets is only available in evaluating prediction results. Here, we compared scBalance with seven methods that are widely used for scRNA-seq cell-type identification: SingleCellNet14, SingleR28, scVI29, scmap-cell27, scmap-cluster27, scPred30 and MARS31, in which scPred and MARS also claimed the ability to treat imbalance single-cell dataset in their papers, and scVI and MARS are deep learning-based methods like scBalance. To ensure our benchmark comparison is under a fair experiment, we used a uniform preprocessing process for each tool and set all parameters as default. All the experiments were conducted based on the fivefold cross-validation to quantify the classification variability. Detailed protocol can be found in “Methods”. We used Cohen’s kappa score to quantitatively evaluate the performance of scBalance and the other seven methods (Fig. 2a). According to the result, scBalance outperforms all other methods on most of these twelve datasets by achieving the highest Cohen’s kappa score. Notably, scBalance particularly performs well on large and complex datasets such as Campbell and Zillions. And the performance of scBalance is the most stable among all these seven methods, giving it an advantage in further atlas-scale reference training. Because Cohen’s kappa score provides a minority class sensitive metric, outperforming on this score gives preliminary evidence that the scBalance has more advantages in rare population annotation.
To better demonstrate the ability of scBalance to accurately annotate minor cell populations, we further investigated the accuracy of each cell type to show whether the overall high performance is exactly obtained by the improvement of minor cell-type identification (Fig. 2b and Supplementary Figs. 2–4, and Supplementary Data 2). We categorized these datasets into three classes: (1) large datasets with a simple cell composition, such as Baron Human, Lake, and Zillions; (2) small datasets with a simple cell background, such as Muraro, Baron Mouse, Deng, etc.; and (3) datasets with complex cell structures, for example, Zheng 68 K, which is primarily composed of T cell and its subtypes so that cells are sharing a high similarity. We first analyzed the performance of scBalance on the Baron Human dataset (Fig. 2b and Supplementary Data 3) and found that all methods perform well on large populations, such as the Beta cell and Alpha cell. However, in minor cell types such as the Mast cell and Epsilon cell, the performance of scBalance still keeps stable and promising, while the other methods fail to recognize most of these rare cell types. These results demonstrate the ability of scBalance to annotate minor cell populations in regular datasets. Similar results can also be found in the result of the small dataset (Supplementary Fig. 3). Furthermore, we were also interested in the performance of scBalance on the dataset with a complex cell background. By analyzing the result on the Zheng 68 K dataset (Supplementary Fig. 4), we found that scBalance is still the best method for identifying rare cell types while maintaining high accuracy in the other types. This result further gives scBalance a practical advantage in real-world problems. In addition, to better understand the true positive detection sensitivity of scBalance for each cell type, we then analyzed the precision of scBalance in these three datasets (Supplementary Tables 1–3). The results show that scBalance is the most robust and sensitive method for identifying the minor cell types compared with the other methods, especially under the complex cell background.
In summary, scBalance performs well on the baseline annotation task, as it has the stable ability to not only successfully identify the major cell types but also the minor cell types.
scBalance outperforms in rare population identification in the inter-protocol annotation task
In the realistic scenario, it’s expected that users may train an annotation tool using a dataset that’s generated from a different protocol than the one used for the query scRNA-seq profile. However, when different sequencing platforms are used, more noise can be introduced, which can affect the inter-dataset annotation task more than the intra-dataset annotation task38. To improve the generalization ability of scBalance in cross-protocol tasks, we used the dropout technique to m`ake our model more robust to the technical variations. We first conducted a comparison experiment between scBalance with dropout and scBalance without dropout on the PBMCBench datasets from different sequencing platforms (Fig. 3a and Supplementary Fig. 5, and Supplementary Data 4) and the Pancreatic datasets from different protocols used in a previous study39 (Supplementary Fig. 6 and Supplementary Data 5). The results show that scBalance with dropout improves the generalization ability and leads to better performance in the inter-dataset annotation task for all sets of datasets. Moreover, we demonstrated the robustness of scBalance to batch effects in cross-dataset annotation tasks. We compared the classification performance of scBalance with and without batch correction using Combat40, a commonly used batch correction tool, to evaluate whether the performance of scBalance can be further improved by batch correction (Supplementary Fig. 7 and Supplementary Data 6). The results indicate that scBalance’s performance is not significantly impacted or improved by batch correction, suggesting that our method itself is robust to the potential negative effects of batch effects.
To further evaluate the performance of scBalance under batch effect and its ability to identify rare cell types, we expanded our benchmarking to include other annotation methods on the inter-dataset annotation task. We utilized the PBMCbench datasets (refer to “Methods” and Table 1) to test and evaluate the performance of each method on every protocol pair, with Cohen’s kappa score being used as the evaluation metric. Meanwhile, we were particularly interested in scBalance’s classification accuracy on minor cell populations, which we defined as cell types with less than 5% of the total cell number. Thus we also quantified the rare cell-type annotation ability along with the overall accuracy. The results, summarized in Fig. 3b, show that scBalance achieved the highest average scores across all experiments (Fig. 3b and Supplementary Data 7). Compared with the second-best method, scBalance elevated the average score from 0.85 to 0.95. Moreover, scBalance was also the best method on most of the test pairs, demonstrating its excellence on the inter-dataset task. Notably, we also analyzed the rare-type classification accuracy of each method (Fig. 3c), and the results show that scBalance outperforms the other methods in accurately identifying minor populations on most of the test pairs in the inter-dataset task. To further show the practicality and efficiency of scBalance, we conducted additional benchmarking experiments to evaluate its performance on the inter-dataset annotation task when other methods are used in conjunction with batch correction methods (Supplementary Fig. 8 and Supplementary Data 8). The results suggest that, while most of the methods demonstrated improvement (average improvements ranged from 1 to 4%) after batch effect correction preprocessing compared to Fig. 3b, scBalance continued to outperform the other methods for the inter-dataset annotation task. This indicates that scBalance remains one of the most efficient tools available for this task. Subsequently, to gains further insights into the classification results of the rare cell population, we used Uniform Manifold Approximation and Projection (UMAP) to visualize the clustering result of the top three highest-performing methods with the prediction label or true label (Fig. 3d). Our analysis revealed that, compared with the true label, SingleCellNet displayed more incorrect annotations on the Megakaryocyte cells and CD16+ monocytes than scBalance. Similarly, scVI demonstrated more incorrect labels on the Megakaryocyte cells and even completely failed on the classification of CD16+ monocytes. In contrast, scBalance provided the most accurate annotation result on all six cell types and successfully labeled the two rare cell populations, Megakaryocytes and CD16+ monocytes. Taken together, the results indicate that scBalance offers a more robust performance than existing methods for cross-platform annotation tasks and retains its outstanding capability of identifying rare cell populations under the influence of technical variations.
Fast and robustness on the running speed enhances the scalability of scBalance
Running time is considered one of the most essential things for an annotation tool in the real single-cell analysis environment as well as the greatest obstacle to scalability. To highlight the superiority of the scBalance on the calculation speed, we presented the comparison results of the six representative methods which all have different basic machine-learning models (Fig. 4). Because of the usage of the GPU, we separately showed the scBalance-CPU and scBalance-GPU in order to make the comparison fair for other methods without GPU computation. We first compared the performance of the scBalance on the different processing units. The result indicates that scBalance-GPU has a large improvement in the running speed, which reduces more than 50% running time compared to the scBalance-CPU (Fig. 4a). Especially, scBalance-GPU gives a robust performance on the datasets with different cell numbers. The running time keeps relatively stable on the samples from 30k cells to 60k cells. This robustness gives scBalance a potential expanding ability to annotate large-scale datasets in a fast manner. We also presented the comparison result of scBalance-CPU with the other five methods. Even though all the methods are based on the CPU, scBalance also gives a promising running speed. Notably, in the datasets with more than 30k cells, scBalance reduces the running time to 10% of the other five methods. In the largest dataset, scBalance gives more than 20 times the computation speed compared with SingleR (Fig. 4b). The advantage in time-consuming also makes scBalance an outstanding tool in large-scale dataset annotation.
Revealing bronchoalveolar immune cell atlas in COVID patient proofs the scalability of scBalance
As the size of the cell atlas continues to increase, the scalability of annotation tools becomes more important. We thus discussed the strength of scBalance to learn rare cell types in the million-level scRNA-seq datasets. We first used the intra-dataset annotation result as proof of concept to evaluate the annotation performance of scBalance on the large-scale cell atlas. We collected two recently published cell atlas including human heart cell atlas41 (487,106 cells) and COVID-19 immune atlas17 (1,462,702 cells). As no other existing methods have reported annotation ability on million-level scRNA-seq profiles, especially it is even hard to load the dataset for R-based methods such as SingleCellNet and Scmap, we compared scBalacne with conventional machine-learning methods such as random forest (n_estimators=50,random_state=10), decision tree, SVM (kernel:rbf), and kNN (k = 3) in Python. As shown in Fig. 5a and Supplementary Data 9, scBalance significantly outperforms the other machine-learning methods on both two cell atlases. In addition, compared with the other methods, scBalance achieves up to 150 times faster running speed when training and labeling the COVID cell atlas (Fig. 5b). Even the threefold increase in cell number between the two datasets, scBalance remains the only method with a robust running speed, providing an advantage in scalability.
In addition to the simple evaluation of the scalability, we used COVID immune atlas as the reference dataset for an instance to illustrate that the annotation result of scBalance can effectively identify rare cell types when training with million-scale references. We also collected Bronchoalveolar lavage fluid (BALF) cells scRNA-seq profile from a severe COVID patient as the query data (Fig. 5c). While there are lots of publications discussing PBMC landscape42,43,44,45 in different COVID patient samples, the BALF cell component of COVID patients still lacks investigation. But as the sample that can most directly reflect microenvironment information on lung alveoli, BALF cells are of great importance to understanding the association of the disease severity and respiratory immune characteristics dynamic. Although Liao et al. revealed bronchoalveolar immune cells landscape in patients with COVID in 202032, their work which is based on the integration of Seurat only identified cell groups in a low resolution. Here, we used scBalance to annotate BALF scRNA-seq dataset. Our method successfully identified much more cell subtypes than the original research by using the COVID atlas as the reference. Compared to the manual labeling method used in the original analysis, scBalance significantly improved annotation resolution for the BALF dataset. In combination with the result in Fig. 5c, d and Supplementary Fig. 9, scBalance identified 64 subtypes of the immune cells in the BALF sample. As expected, macrophages show the highest enrichment in the BALF sample whereas B cells only be a small part of the immune landscape. Notably, scBalance also identified rare subtypes in all cell groups. In the myeloid group, our method elucidates that there are also monocyte locates in the BALF instead of only macrophages. But macrophage cells are still the major component, especially the pro-inflammatory macrophage (M1) such as CCL3L1+ macrophage, which suggests a strong immune cell recruitment signal in BALF in the severe patient. Meanwhile, different from the analysis by Liao et al.32, our method reveals that the pro-inflammatory environment is not only produced by macrophages but also by CD14 monocyte (CCL3+). Furthermore, our method also found that a significant expansion of proliferative memory T cells (including MKI67-CCL4 (high) CD4 T cell and MKI67-CCL4 (low) CD4 T cell), compared with effector T cells, are enriched in the lung region. Together, our methods successfully identified cell subtypes and provide a more comprehensive immune atlas in the BALF by using the COVID cell atlas as the reference. It is worth noting that most of the cell types revealed by scBalance are rare in the COVID atlas, which further presents the advantage of identifying rare cell types of our method in the large-scale scRNA-seq dataset.