Blog - Cell type annotation of clusters in single-cell RNA-seq data

Cell type annotation of clusters in single-cell RNA-seq data

20 June 2023

In this article, we use CellKb to assign cell types to clusters in the dataset of Peripheral Blood Mononuclear Cells (PBMC) available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here. This dataset has also been analyzed in the Scanpy and Seurat clustering tutorials.

Clustering and differential expression analysis

We followed the Seurat tutorial to identify the clusters and differentially expressed genes in each cluster. You may use Scanpy instead of Seurat. The code is as follows:


# Seurat version 5.0.1
library(dplyr)
library(Seurat)
pbmc.data <- Read10X(data.dir="pbmc3k_filtered_gene_bc_matrices\\filtered_gene_bc_matrices\\hg19", gene.column=1)
pbmc <- CreateSeuratObject(counts=pbmc.data, project="pbmc3k", min.cells=3, min.features=200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern="^MT-")
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method="vst", nfeatures=2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features=all.genes)
pbmc <- RunPCA(pbmc, features=VariableFeatures(object=pbmc))
pbmc <- FindNeighbors(pbmc, dims=1:10)
pbmc <- FindClusters(pbmc, resolution=0.5)
pbmc <- RunUMAP(pbmc, dims=1:10)
DimPlot(pbmc, reduction="umap")
Idents(pbmc) <- "seurat_clusters"
all.markers = FindAllMarkers(pbmc, only.pos=T)
all.markers = all.markers[all.markers$p_val_adj<=0.01 & all.markers$avg_log2FC >= 0.5,]
write.table(all.markers, "pbmc_cluster_markers.txt", sep="\t", row.names=TRUE, quote=FALSE, col.names=NA)

#scanpy version 1.11.1
import pandas as pd
import scanpy as sc

adata = sc.read_10x_mtx("filtered_gene_bc_matrices\\hg19", var_names="gene_ids") 
adata.var["mt"] = adata.var_names.str.startswith("MT-") # mitochondrial genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL")) # ribosomal genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]") # hemoglobin genes
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True)
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.scrublet(adata)
adata.layers["counts"] = adata.X.copy()

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, n_iterations=2)
sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
genes_df = sc.get.rank_genes_groups_df(adata, group=None)
genes_df = genes_df[(genes_df['pvals_adj'] <= 0.1) & (genes_df['logfoldchanges'] >= 0.5)]
genes_df = genes_df.rename(columns={'group': 'Cluster', 'names': 'Gene'})
genes_df.to_csv("pbmc_cluster_markers_scanpy.txt", sep="\t")

Please note the above script uses all genes (and not just the highly variable genes) in the FindAllMarkers function. Also, we have used a cutoff on the adjusted p value and log fold change to select the genes.

Input the gene lists to CellKb

The cluster-specific gene lists can now be given as input on the CellKb search page, with "blood" in the filter criteria.

The CellKb results

The search completes in a few seconds and the summary page shows the cell types of the top matching hits in literature. In contrast to other methods, CellKb's search method compares each input gene list with every gene signature from every publication, without aggregating, or using a prediction model.

For each cluster, the CellKb results page gives the predicted cell type based on a heuristic after parsing the top 20 matches. We can choose the best matching cell type based on the predicted cell type and verifying with the heatmap and the detailed results.

Assigning the cell type identity to the Seurat object

We can now assign the cell type identity into our Seurat object and visualize the annotation.

  new.cluster.ids <- c("CD4-positive, alpha-beta T cell","CD14-positive monocyte", 
                    "Central memory CD4-positive, alpha-beta T cell", "Naive B cell", 
                    "CD8-positive, alpha-beta T cell",  "CD16-positive monocyte",
                    "Natural killer cell", "Dendritic cell", "Platelet" )
  names(new.cluster.ids) <- levels(pbmc)
  pbmc <- RenameIdents(pbmc, new.cluster.ids)
  DimPlot(pbmc, reduction="umap", label=TRUE, repel=T, pt.size=0.75)

  new_cluster_names = [
      "CD4-positive, alpha-beta T cell",
      "CD14-positive monocyte",
      "B cell",
      "CD8-positive, alpha-beta T cell",
      "CD16+ monocyte",
      "Natural killer cell",         
      "Conventional dendritic cell",    
      "Megakaryocyte",
  ]
  adata.rename_categories("leiden", new_cluster_names)
  sc.pl.umap(adata, color="leiden", legend_loc="on data", title="", frameon=False)

Comparing the results with the results from other methods

Let's see how CellKb's annotation for the pbmc3K dataset compares with the results from other methods.

Cluster	Annotation using CellKb	Annotation in Seurat Tutorial	Annotation in Scanpy Tutorial	Annotation using SingleR	Annotation in decoupler Tutorial (PanglaoDB)
0	CD4-positive, alpha-beta T cell	Memory CD4+	CD4 T	T_cell: CD4+_central_memory	T cells
1	CD14-positive monocyte	CD14+ Mono	CD14+ Monocytes	Monocyte:CD16-	Acinar cells
2	Central memory CD4-positive, alpha-beta T cell	Naive CD4+T	CD4 T	T cell: CD4+_central_memory	T cells
3	Naive B cell	B	B	B_cell:immature	B cells naive
4	CD8-positive, alpha-beta T cell	CD8+ T	CD8 T	T_cell:CD8+	NK cells
5	CD16-positive monocyte	FCGR3A+ Mono	FCGR3A+ Monocytes	Monocyte:CD16+	Macrophages
6	Natural killer cell	NK	NK	NK_cell	Gamma Delta T cells
7	Dendritic cell	DC	Dendritic	Monocyte:CD16-	Dendritic cells
8	Platelet	Platelet	Megakaryocytes	Monocyte:CD16-	Hepatic stellate cells

For the pbmc3K dataset, CellKb's annotation matches with that given in the Seurat/Scanpy tutorials. The Seurat/Scanpy tutorials require the annotation to be done manually by referring to canonical marker genes of known cell types. With SingleR, we did the annotation automatically using celldex (HumanPrimaryCellAtlasData) as a reference, but two clusters (dendritic cells and platelets) were not correctly annotated. With PanglaoDB, we took the annotations from the decoupler tutorial, where the annotation is done automatically based on canonical markers from PanglaoDB, but it does not match well.

Conclusion

Our benchmark result, which you can reproduce with the pbmc3K dataset show that with CellKb, you not only save the time and effort of manually looking up marker genes, but also annotate accurately using a comprehensive set of publications as a reference. CellKb uses standardized ontology terms in the results, making it easier to integrate in your pipelines. In this article, we have shown how you can annotate clusters accurately using CellKb by giving only the genelists as input, without requiring to submit the raw data.