Annotating clusters in the pbmc3K dataset
In this blog post, we use CellKb to assign cell types to clusters in the dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here. This is the same dataset analyzed in the Scanpy and Seurat clustering tutorials.
Clustering and differential expression analysis
We followed the Seurat tutorial to identify the clusters and differentially expressed genes in each cluster. You may use Scanpy instead of Seurat. The code is as follows:

  # Seurat version 5.0.1
  library(dplyr)
  library(Seurat)
  pbmc.data <- Read10X(data.dir="pbmc3k_filtered_gene_bc_matrices\\filtered_gene_bc_matrices\\hg19", gene.column=1)
  pbmc <- CreateSeuratObject(counts=pbmc.data, project="pbmc3k", min.cells=3, min.features=200)
  pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern="^MT-")
  pbmc <- NormalizeData(pbmc)
  pbmc <- FindVariableFeatures(pbmc, selection.method="vst", nfeatures=2000)
  all.genes <- rownames(pbmc)
  pbmc <- ScaleData(pbmc, features=all.genes)
  pbmc <- RunPCA(pbmc, features=VariableFeatures(object=pbmc))
  pbmc <- FindNeighbors(pbmc, dims=1:10)
  pbmc <- FindClusters(pbmc, resolution=0.5)
  pbmc <- RunUMAP(pbmc, dims=1:10)
  DimPlot(pbmc, reduction="umap")
  Idents(pbmc) <- "seurat_clusters"
  all.markers = FindAllMarkers(pbmc, only.pos=T)
  all.markers = all.markers[all.markers$p_val_adj<=0.01 & all.markers$avg_log2FC >= 0.5,]
  write.table(all.markers, "pbmc_cluster_markers.txt", sep="\t", row.names=TRUE, quote=FALSE, col.names=NA) 

  #scanpy version 1.9.6
  import pandas as pd
  import scanpy as sc
  
  adata = sc.read_10x_mtx("pbmc3k_filtered_gene_bc_matrices\\filtered_gene_bc_matrices\\hg19", var_names="gene_ids") 
  sc.pp.filter_cells(adata, min_genes=200)
  sc.pp.filter_genes(adata, min_cells=3)
  adata.var["mt"] = adata.var_names.str.startswith("MT-")
  sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)
  adata = adata[adata.obs.n_genes_by_counts < 2500, :]
  adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
  sc.pp.normalize_total(adata, target_sum=1e4)
  sc.pp.log1p(adata)
  sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
  adata.raw = adata
  adata = adata[:, adata.var.highly_variable]
  sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
  sc.pp.scale(adata, max_value=10)
  sc.tl.pca(adata, svd_solver="arpack")
  sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
  sc.tl.leiden(adata, resolution=0.9, random_state=0, n_iterations=2, directed=False)
  sc.tl.umap(adata)
  adata_all_genes = adata.raw.to_adata()
  sc.tl.rank_genes_groups(adata_all_genes, "leiden", method="wilcoxon")
  genes_df = sc.get.rank_genes_groups_df(adata_all_genes, group=None)
  genes_df = genes_df[(genes_df['pvals_adj'] <= 0.1) & (genes_df['logfoldchanges'] >= 0.5)]
  genes_df = genes_df.rename(columns={'group': 'Cluster', 'names': 'Gene'})
  genes_df.to_csv("pbmc_cluster_markers_scanpy.txt", sep="\t")
Please note the above script uses all genes (and not just the highly variable genes) in the FindAllMarkers function. Also, we have used a cutoff on the adjusted p value and log fold change to select the genes.
Input the gene lists to CellKb
The cluster-specific gene lists can now be given as input on the CellKb search page, with "blood" in the filter criteria.
Responsive image
The CellKb results
The search completes in a few seconds and the summary page shows the cell types of the top matching hits in literature. In contrast to other methods, CellKb's search method compares each input gene list with every gene signature from every publication, without aggregating, or using a prediction model.
Responsive image
For each cluster, the CellKb results page gives the predicted cell type based on a heuristic after parsing the top 20 matches. We can choose the best matching cell type based on the predicted cell type and verifying with the heatmap and the detailed results.
Responsive image
Responsive image
Assigning the cell type identity to the Seurat object
We can now assign the cell type identity into our Seurat object and visualize the annotation.
  new.cluster.ids <- c("CD4-positive, alpha-beta T cell","CD14-positive monocyte", 
                    "Central memory CD4-positive, alpha-beta T cell", "Naive B cell", 
                    "CD8-positive, alpha-beta T cell",  "CD16-positive monocyte",
                    "Natural killer cell", "Dendritic cell", "Platelet" )
  names(new.cluster.ids) <- levels(pbmc)
  pbmc <- RenameIdents(pbmc, new.cluster.ids)
  DimPlot(pbmc, reduction="umap", label=TRUE, repel=T, pt.size=0.75)
  
Responsive image
  new_cluster_names = [
      "CD4-positive, alpha-beta T cell",
      "CD14-positive monocyte",
      "B cell",
      "CD8-positive, alpha-beta T cell",
      "CD16+ monocyte",
      "Natural killer cell",         
      "Conventional dendritic cell",    
      "Megakaryocyte",
  ]
  adata.rename_categories("leiden", new_cluster_names)
  sc.pl.umap(adata, color="leiden", legend_loc="on data", title="", frameon=False)
Responsive image
Comparing the results with the results from other methods
Let's see how CellKb's annotation for the pbmc3K dataset compares with the results from other methods.

Cluster Annotation using CellKb Annotation in Seurat TutorialAnnotation in Scanpy Tutorial Annotation using SingleRAnnotation in decoupler Tutorial (PanglaoDB)
0 CD4-positive, alpha-beta T cell Memory CD4+CD4 T T_cell: CD4+_central_memoryT cells
1 CD14-positive monocyte CD14+ MonoCD14+ Monocytes Monocyte:CD16-Acinar cells
2 Central memory CD4-positive, alpha-beta T cell Naive CD4+TCD4 T T cell: CD4+_central_memoryT cells
3 Naive B cell BB B_cell:immatureB cells naive
4 CD8-positive, alpha-beta T cell CD8+ TCD8 T T_cell:CD8+NK cells
5 CD16-positive monocyte FCGR3A+ MonoFCGR3A+ Monocytes Monocyte:CD16+Macrophages
6 Natural killer cellNK NKNK_cell Gamma Delta T cells
7 Dendritic cell DCDendritic Monocyte:CD16-Dendritic cells
8 Platelet PlateletMegakaryocytes Monocyte:CD16-Hepatic stellate cells
For the pbmc3K dataset, CellKb's annotation matches with that given in the Seurat/Scanpy tutorials. The Seurat/Scanpy tutorials require the annotation to be done manually by referring to canonical marker genes of known cell types. With SingleR, we did the annotation automatically using celldex (HumanPrimaryCellAtlasData) as a reference, but two clusters (dendritic cells and platelets) were not correctly annotated. With PanglaoDB, we took the annotations from the decoupler tutorial, where the annotation is done automatically based on canonical markers from PanglaoDB, but it does not match well.
Conclusion
Our benchmark result, which you can reproduce with the pbmc3K dataset show that with CellKb, you not only save the time and effort of manually looking up marker genes, but also annotate accurately using a comprehensive set of publications as a reference. CellKb uses standardized ontology terms in the results, making it easier to integrate in your pipelines. In this blog post, we have shown how you can annotate clusters accurately using CellKb by giving only the genelists as input, without requiring to submit the raw data.