In this article, we use CellKb to assign cell types to clusters in the dataset of Peripheral Blood Mononuclear Cells (PBMC) available from 10X Genomics.
There are 2,700 single cells that were sequenced on the Illumina NextSeq 500.
The raw data can be found
here .
This dataset has also been analyzed in the Scanpy and Seurat clustering tutorials.
We followed the
Seurat tutorial to identify the clusters and differentially expressed genes in each cluster.
You may use
Scanpy instead of Seurat. The code is as follows:
R
Python
# Seurat version 5.0.1
library(dplyr)
library(Seurat)
pbmc.data <- Read10X(data.dir="pbmc3k_filtered_gene_bc_matrices\\filtered_gene_bc_matrices\\hg19", gene.column=1)
pbmc <- CreateSeuratObject(counts=pbmc.data, project="pbmc3k", min.cells=3, min.features=200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern="^MT-")
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method="vst", nfeatures=2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features=all.genes)
pbmc <- RunPCA(pbmc, features=VariableFeatures(object=pbmc))
pbmc <- FindNeighbors(pbmc, dims=1:10)
pbmc <- FindClusters(pbmc, resolution=0.5)
pbmc <- RunUMAP(pbmc, dims=1:10)
DimPlot(pbmc, reduction="umap")
Idents(pbmc) <- "seurat_clusters"
all.markers = FindAllMarkers(pbmc, only.pos=T)
all.markers = all.markers[all.markers$p_val_adj<=0.01 & all.markers$avg_log2FC >= 0.5,]
write.table(all.markers, "pbmc_cluster_markers.txt", sep="\t", row.names=TRUE, quote=FALSE, col.names=NA)
#scanpy version 1.10.2
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx("pbmc3k_filtered_gene_bc_matrices\\filtered_gene_bc_matrices\\hg19", var_names="gene_ids")
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata.raw = adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.leiden(adata, resolution=0.9, random_state=0, n_iterations=2, directed=False)
sc.tl.umap(adata)
adata_all_genes = adata.raw.to_adata()
sc.tl.rank_genes_groups(adata_all_genes, "leiden", method="wilcoxon")
genes_df = sc.get.rank_genes_groups_df(adata_all_genes, group=None)
genes_df = genes_df[(genes_df['pvals_adj'] <= 0.1) & (genes_df['logfoldchanges'] >= 0.5)]
genes_df = genes_df.rename(columns={'group': 'Cluster', 'names': 'Gene'})
genes_df.to_csv("pbmc_cluster_markers_scanpy.txt", sep="\t")
Please note the above script uses all genes (and not just the highly variable genes) in the FindAllMarkers function. Also, we have used a cutoff on the adjusted p value and log fold change to select the genes.
Input the gene lists to CellKb
The cluster-specific gene lists can now be given as input on the CellKb search page, with "blood" in the filter criteria.
The CellKb results
The search completes in a few seconds and the summary page shows the cell types of the top matching hits in literature. In contrast to other methods, CellKb's search method compares each input gene list with every gene
signature from every publication, without aggregating, or using a prediction model.
For each cluster, the CellKb results page gives the predicted cell type based on a heuristic after parsing the top 20 matches.
We can choose the best matching cell type based on the predicted cell type and verifying with the heatmap and the detailed results.
Assigning the cell type identity to the Seurat object
We can now assign the cell type identity into our Seurat object and visualize the annotation.
R
Python
new.cluster.ids <- c("CD4-positive, alpha-beta T cell","CD14-positive monocyte",
"Central memory CD4-positive, alpha-beta T cell", "Naive B cell",
"CD8-positive, alpha-beta T cell", "CD16-positive monocyte",
"Natural killer cell", "Dendritic cell", "Platelet" )
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction="umap", label=TRUE, repel=T, pt.size=0.75)
new_cluster_names = [
"CD4-positive, alpha-beta T cell",
"CD14-positive monocyte",
"B cell",
"CD8-positive, alpha-beta T cell",
"CD16+ monocyte",
"Natural killer cell",
"Conventional dendritic cell",
"Megakaryocyte",
]
adata.rename_categories("leiden", new_cluster_names)
sc.pl.umap(adata, color="leiden", legend_loc="on data", title="", frameon=False)
Comparing the results with the results from other methods
Let's see how CellKb's annotation for the pbmc3K dataset compares with the results from other methods.
Cluster
Annotation using CellKb
Annotation in Seurat Tutorial Annotation in Scanpy Tutorial
Annotation using SingleR Annotation in decoupler Tutorial (PanglaoDB)
0
CD4-positive, alpha-beta T cell
Memory CD4+ CD4 T
T_cell: CD4+_central_memory T cells
1
CD14-positive monocyte
CD14+ Mono CD14+ Monocytes
Monocyte:CD16- Acinar cells
2
Central memory CD4-positive, alpha-beta T cell
Naive CD4+T CD4 T
T cell: CD4+_central_memory T cells
3
Naive B cell
B B
B_cell:immature B cells naive
4
CD8-positive, alpha-beta T cell
CD8+ T CD8 T
T_cell:CD8+ NK cells
5
CD16-positive monocyte
FCGR3A+ Mono FCGR3A+ Monocytes
Monocyte:CD16+ Macrophages
6
Natural killer cell NK
NK NK_cell
Gamma Delta T cells
7
Dendritic cell
DC Dendritic
Monocyte:CD16- Dendritic cells
8
Platelet
Platelet Megakaryocytes
Monocyte:CD16- Hepatic stellate cells
For the pbmc3K dataset, CellKb's annotation matches with that given in the Seurat/Scanpy tutorials.
The Seurat/Scanpy tutorials require the annotation to be done manually by referring to canonical marker genes of known cell types.
With SingleR, we did the annotation automatically using celldex (HumanPrimaryCellAtlasData) as a reference, but two clusters (dendritic cells and platelets) were not correctly annotated.
With PanglaoDB, we took the annotations from the decoupler tutorial, where the annotation is done automatically based on canonical markers from PanglaoDB, but it does not match well.
Conclusion
Our benchmark result, which you can reproduce with the pbmc3K dataset show that with CellKb, you not only save the time and effort of manually looking up marker genes, but also annotate accurately using a comprehensive set of publications as a reference.
CellKb uses standardized ontology terms in the results, making it easier to integrate in your pipelines.
In this article, we have shown how you can annotate clusters accurately using CellKb by giving only the genelists as input, without requiring to submit the raw data.