About CellKb
Single-cell RNA-seq has become a popular method used to study transcriptional patterns of genes in individual cells. In spite of current advances in technology and computational methods, assigning cell types in single-cell datasets remains a bottleneck. Due to the lack of a comprehensive reference database, researchers often spend considerable time and effort searching through scientific literature to find marker genes associated with cell types.
CellKb is a knowledgebase of author-defined cell type marker gene sets that can be rapidly searched for matching cell types. Marker gene sets are collected directly from publications describing mainly single-cell, and selected bulk RNA-seq or microarray experiments. It contains extensive cell type, tissue and disease annotations for each cell type as described in the publication. Protein-protein interactions from HitPredict are integrated into CellKb.

Data Collection
The cell type markers in CellKb are taken primarily from single-cell RNA-seq experiments. This is done by extracting the cell-type specific marker genes defined by authors based on their significant change in expression within a group of similar cells.
The marker gene sets are extracted from tables, figures or supplementary materials of publications describing these experiments. Marker genes are also extracted from select bulk RNA-seq and microarray experiments from public databases. These include gene signatures from Human Protein Altas, SingleR and MSig-db.
All publications related to single-cell experiments are taken from PubMed and manually screened to select those identifying cell type specific gene expression patterns.
24,132 Marker gene sets
Selection
475 Publications
Marker gene sets are selected from published experimental studies using the following criteria:
  • Availability of data for download
  • Type of experimental method
  • Number of cells studied
  • Computational methods used to normalize, filter and cluster cell types, along with identification of cluster-specific genes
  • Availability of associated values (eg. average expression, fold change, statistical significance)
For version 2.0 of CellKb, 475 high quality publications from 9 species were selected after screening over 10,000 studies published between 2013 and 2020.
Curation
Each selected publication is read and information about the name, tissue, condition and any special characteristics of each cell type is manually extracted and associated with the corresponding gene marker set.
Genes from the selected marker sets are mapped to valid entries in the latest version of the Ensembl database. Only genes with valid identifiers and associated values are retained. Orthologous genes are identified across all selected species using the Ensembl database.
255 Organs/Tissues
Annotation
1,977 Cell Types
Cell types, tissue names and disease conditions are assigned standardized ontology terms. Associated values given by authors with each signature are also stored in CellKb. These include, but are not limited to, rank, score, average expression, log fold change and corrected/uncorrected p-values.
The cell type specific marker genes are either directly taken as defined by the authors, or they are calculated based on the associated values provided. Finally, all marker genes and their annotations are stored in a standardized format that enables rapid searching of the data.
Reliability scores are assigned to each marker set based on its similarity to other gene sets of the same cell type.
Consensus signatures are calculated for every cell and uberon ontology by aggregating the marker genes in all the associated cell type signatures from the database. Users can search consensus signatures using a gene list and get a list of consensus marker genes for all cell types.
Interaction networks for user genes are calculated from protein-protein interactions obtained from the HitPredict database. Interaction reliability scores are also taken from HitPredict.

Functionality
Search cell types using genes
Use one or more lists of marker genes to find matching cell type signatures published in literature. Multiple gene lists need to be assigned a cluster or cell type identifier. CellKb uses a rank-based method to identify the cell type marker gene sets matching the users gene list. Other statistics such as the Fisher's exact test, Pearson's correlation coefficient and Jaccard index are also calculated for each cell type match. Ranks of common genes in the user gene list and matching cell types are provided.
Users can also search through consensus marker gene sets using this option, to predict the consensus signature that matches best with the list of input genes.
Users can view interaction networks of the query genes based on protein-protein interactions experimentally identified for the species of interest. Marker genes of the matching cell type are indicated in the interaction network.
Search cell types with keywords
This functionality helps users find cell types in CellKb based on their annotations, eg. the tissue in which they are found, the experimental conditions, the publication information or disease information. It also allows users to find all cell types having a specific gene in their marker list.
Search cell types by ontology
Users can navigate the entire database contents of CellKb by species, publication, disease, organ/tissue or cell type ontology, tissue or disease ontology. See experimental details and get cell type marker genes.
Search cell types across species
Use a list of marker genes from one species to search matching cell type signatures in another species. Users can give a ranked list of gene markers from one species to find matching cell types in another species. This is done by identifying orthologous gene pairs between two species, where available, in Ensembl. This is particularly useful given the bias of cell type identification studies conducted in one species (eg. mouse) versus another (eg. human).

Licensing Options
A significant portion of the data in CellKb is free for academic users to search and browse through CellKb Immune. Commercial users can also signup for the trial license to evaluate CellKb for a limited period.

Academic and commercial users need to purchase a paid subscription to CellKb to access and download all data. We also provide paid service options to customize CellKb or integrate it with other databases.
If you would like to partner with us or know more about how you can use CellKb, please feel free to contact us.

People
CellKb is developed by Ashwini Patil PhD with technical support by Ajay Patil. CellKb is licensed through Combinatics Inc., a Tokyo-based bioinformatics company. Prior to founding Combinatics, Ashwini was a Lecturer at the University of Tokyo.

References
1. CellKb Immune: a manually curated database of mammalian immune marker gene sets optimized for rapid cell type identification. biorxiv, 2020. Preprint