Input Data Format Specification#
This document describes the expected input data format for scXpand training and inference.
The framework expects single-cell RNA sequencing data in .h5ad format with specific annotation fields in the obs (observations) metadata and gene expression data in the X matrix.
Gene Expression Matrix (X)#
The X matrix contains the gene expression data for all cells:
Format: Dense or sparse matrix (CSR format recommended for memory efficiency)
Shape:
(n_cells, n_genes)where rows are cells and columns are genesData Type: Integer counts (stored as
int32,int64, orfloat32/float64with integer values)Values: Raw gene expression counts (non-negative integers)
Preprocessing: The framework expects raw counts and applies its own preprocessing pipeline
Raw counts are processed according to the data preprocessing configuration
Pre-normalized or log-transformed data is not supported
Gene Annotations (var)#
Gene information is stored in the var (variables) metadata:
Index: Gene identifiers (e.g., gene symbols, Ensembl IDs)
Requirement: Gene identifiers should be consistent across datasets if combining multiple studies
Note
Our entire pipeline is designed to work only with ensembl_ids for consistency across datasets. If your data uses gene symbols, please convert them to ensembl_ids before using our pre-trained models.
Cell Annotations (obs)#
Required for Training#
Field |
Type |
Description |
Usage |
|---|---|---|---|
|
str |
Study identifier for data splitting |
Training only - Used for patient-level train/validation splits |
|
str |
Unique patient identifier within study |
Training only - Used for patient-level train/validation splits |
|
str |
Cancer type annotation |
Training only - Used for stratified splitting |
|
str |
Sample identifier within patient |
Training only - Used for data organization |
|
str |
Expansion label (“expanded” or “non-expanded”) |
Training only - Target variable for model training |
|
int |
Number of cells with this clone_id in the current sample |
Training only - Used for soft label computation |
|
int |
Median clone size in the sample |
Training only - Used for soft label computation |
|
str |
Tissue type annotation |
Training only - Used for evaluation stratification |
|
str |
Cell type labels |
Training only - Used for evaluation stratification |
Note
tissue_type and imputed_labels can also be used as auxiliary labels in some training configurations, in addition to their use for computing stratified evaluation metrics. They are not required for the actual inference/prediction process.
Expansion Definition#
The expansion field should contain string values:
"expanded"- for cells considered part of expanded clones"non-expanded"- for cells not considered part of expanded clones
The framework uses a 1.5× median clone size threshold: a cell is considered expanded if its clone_id_size > 1.5 × median_clone_size for that sample.
Required for Inference#
Applying our pre-trained models only for inference purposes requires:
Filtration of the gene expression matrix to include only T cells
Gene representation using ensembl_ids
Our platform will be able to handle missing genes, different gene orders, and additional genes not used by the model.