Towards Reproducible Cell Type Annotations

Why annotation quality in published single-cell studies falls short and what we can do about it

March 4, 2026 Dr. Ashwini Patil

In brief

Cell type annotations in published single-cell studies are frequently inconsistent with undefined abbreviations, non-standard terminology, marker-gene–named clusters and missing ontology mappings.
Poor annotations in reference datasets propagate errors into automated annotation pipelines and waste significant researcher time.
Initiatives like CellxGene enforce ontology-mapped annotations but cannot always capture the granularity of author-assigned labels.
Automated tools alone are unable to resolve ambiguous or non-standard labels and manual curation by domain experts remains essential.
Authors, journals and the broader community all have a role to play in raising cell type annotation standards.

Introduction

Single-cell and spatial transcriptomics have revolutionized our understanding of cellular diversity in health and disease. These datasets are not only valuable as primary research contributions but increasingly serve as reference resources for automated cell type annotation in new studies. However, the quality of cell type annotations in published single-cell studies remains a growing and underappreciated problem. Cluster labels are ambiguous, abbreviations go undefined, cell type names lack standardization and ontology mappings are almost always absent. For researchers who rely on published datasets as annotation references or who attempt to integrate findings across studies, these shortcomings create significant costs in time, accuracy and reproducibility.

Table 1: Examples of common cell type annotation problems in published single-cell studies

Problem	Example	Impact
Undefined abbreviation	ssFib	Cell type identity cannot be determined
Ambiguous abbreviation	NMP	Incorrect cell type assignment (neuromesodermal progenitor assigned as neutrophil-myeloid progenitor)
Unannoated cluster numbers	C1, C2, C3	Cell type names cannot be linked to underlying data
Marker gene as cell type label	Zscan4-high	Cannot be mapped to an ontology term
Non-standard terminology	Node	Cannot be mapped to existing nomenclature
Missing ontology mapping	Vein_shearstress	No Cell Ontology term exists
Insufficiently resolved annotation	T cell 1, T cell 2	Only mappable to broad T cell ontology term

Problem 1: Undefined abbreviations

One of the most pervasive annotation problems in published single-cell studies is the use of abbreviations that are never formally defined in the main text, figure legends or supplementary tables. Cluster labels appear on UMAPs and in data files without any accompanying glossary, leaving readers to infer their meaning from context.

This problem is compounded by the absence of field-wide naming conventions. Different research groups use different abbreviations for the same cell type and the same abbreviation can mean entirely different things across papers or tissue contexts. Without explicit definitions there is no reliable way to resolve these ambiguities from the data alone.

A particularly problematic variant of this issue occurs when cluster numbers are used in supplementary materials or the main text while cell type names are shown only in figures. When the mapping between cluster numbers and cell type names is not explicitly provided, or when figure labels are insufficiently detailed, it becomes impossible to reliably connect the data to the biology. In such cases the dataset cannot be meaningfully annotated or integrated with other studies making it unusable as a reference resource.

Problem 2: Non-standard terminology

Non-standard terminology is a related but distinct challenge. In some cases, it is scientifically justified: when a study identifies a genuinely novel cell population, new terminology is necessary. However, the new name needs to be clearly defined, its relationship to existing cell types explained and its defining markers documented. Without this even legitimate biological novelty becomes a source of confusion rather than insight.

More problematic is non-standard terminology that arises not from biological novelty but a lack of familiarity with existing cell type nomenclature in the literature. Cell types that have been well characterized and consistently named across multiple studies are sometimes assigned new group-specific names, making cross-study integration difficult.

Problem 3: Marker gene names used as cell type labels

A related practice is naming clusters after a single defining marker gene rather than assigning a proper cell type name. A cluster named after a gene describes a transcriptional feature but does not identify what kind of cell it represents and cannot be mapped to any ontology term or integrated with annotations from other studies. The actual cell type may be described elsewhere in the paper but has to be searched for.

Where the identity is genuinely uncertain, labeling a cluster as "uncharacterized fibroblast" with its defining markers is more useful than a gene name. In many cases studying the cell types described in literature often reveals that the cluster corresponds to a population already described and named in related studies.

Problem 4: Missing or incorrect ontology mappings

Standardized cell type ontologies enable interoperability between datasets, studies and analytical tools, making annotations machine-readable and compatible with downstream AI applications. Despite the availability of well-maintained ontologies and tools to apply them, ontology mappings are absent from the majority of published single-cell studies. Where mappings are provided, overly broad parent terms are sometimes used where more specific terms exist, reducing annotation resolution.

Initiatives like CellxGene have made important progress by requiring ontology-mapped annotations as a condition of data submission. However, because the Cell Ontology does not always contain terms at the granularity of the original paper, CellxGene annotations sometimes default to the nearest available parent term, sacrificing biological precision. This highlights the need for ontology standards to keep pace with the biology.

The cost of poor annotations

Figure 1: Annotation propagation schematic

Figure 1: Ambiguous annotations in a reference dataset propagate errors into automated annotation of a new dataset compared to a well-annotated reference.

The consequences of inconsistent annotation extend well beyond individual inconvenience. When published datasets are used as references for automated cell type annotation, errors and ambiguities propagate directly into annotations assigned to new datasets, potentially misclassifying cells in ways that are difficult to detect.

Researchers routinely spend substantial time attempting to resolve ambiguous labels, locate abbreviation definitions that were never provided and determine whether a cluster label refers to an established cell type or an author-specific finding. This effort is largely invisible but cumulatively represents an enormous drain on research productivity across the field.

Perhaps most significantly, datasets containing genuinely novel cell types are sometimes rendered unusable as reference resources due to poor annotation quality, leading to gaps in reference databases that could otherwise be more comprehensive and biologically informative. Without consistent, well-defined annotations, findings cannot be reliably reproduced, datasets cannot be meaningfully integrated and the cumulative knowledge embedded in published single-cell studies remains difficult to access and build upon.

Solutions and recommendations

The most impactful change individual authors can make is to provide a supplementary table mapping every cluster label to its full cell type name, defining markers and corresponding Cell Ontology term. Specific recommendations include:

Define all cluster label abbreviations explicitly in a supplementary glossary
Use established cell type names from the relevant literature wherever possible
Avoid using gene names as cluster labels and describe uncharacterized populations as such with their marker gene profiles
When coining new terminology for genuinely novel populations, clearly define the name, document defining markers and explain the relationship to existing cell types
Where Cell Ontology terms are unavailable for novel populations, use the most appropriate parent term and document the reasoning

Journals are well positioned to drive improvement by incorporating annotation quality into peer review criteria and requiring standardized annotation tables as part of supplementary data for single-cell publications. Data sharing and code availability have become standard expectations and annotation quality deserves equivalent attention.

Researchers across the field can contribute by treating annotation quality as a shared responsibility. The cell type labels assigned to a dataset today will shape how it is interpreted and reused for years to come and consistent, well-defined annotations are a prerequisite for reproducibility and meaningful cross-study integration.

Conclusion

The single-cell and spatial transcriptomics field has produced an extraordinary body of knowledge but its full value cannot be realized if cell type annotations remain inconsistent and poorly standardized. The problems described in this post are not insurmountable. They require awareness, adherence to existing standards and a recognition that rigorous cell type annotation is as integral to a single-cell study as the sequencing and analysis that precede it.

Text mining and AI-based tools are improving rapidly in their ability to extract cell type information from published studies, reducing the time needed to process large volumes of literature. However, manual curation remains the most reliable way to resolve ambiguous labels, verify biological accuracy and ensure consistent ontology mapping at scale.

The CellKb database is built on this principle, relying primarily on expert manual curation to select high-quality signatures, resolve non-standard annotations and harmonize cell type labels against standardized ontologies, producing a comprehensive AI-ready reference that captures cell type identity across tissues, diseases, species and both disease and homeostatic conditions.

With greater attention from authors, journals and the community, and with continued investment in high-quality curated resources and tools bridging the automated and manual approaches, the goal of reproducible cell type annotation across single-cell and spatial transcriptomics studies is well within reach.