scxpand.data_util.data_format#

Data format specification and preprocessing parameters for scXpand models.

Functions

load_data_format(load_path)

Load a DataFormat object from saved files.

Classes

DataFormat(**data)

Data format specification and preprocessing parameters for scXpand models.

class scxpand.data_util.data_format.DataFormat(**data)#

Data format specification and preprocessing parameters for scXpand models.

Contains all metadata and parameters needed to consistently preprocess single-cell expression data. Stores gene information, normalization parameters, and preprocessing settings used during model training.

This class ensures that inference data is processed identically to training data by preserving gene ordering, normalization statistics, and preprocessing pipeline configuration.

Variables:
  • n_genes – Number of genes in the dataset.

  • gene_names – Ordered list of gene names.

  • genes_mu – Per-gene means for z-score normalization.

  • genes_sigma – Per-gene standard deviations for z-score normalization.

  • use_log_transform – Whether to apply log1p transformation.

  • use_zscore_norm – Whether to apply z-score normalization.

  • target_sum – Target sum for row normalization (typically 10,000).

  • aux_categorical_types – Tuple of categorical feature names to include as auxiliary targets.

  • aux_categorical_mappings – Dictionary mapping categorical features to their integer encodings.

convert_genes_expression_matrix(adata)#

Reorder genes in AnnData to match the data format gene order.

This function: - Reorders genes to match self.gene_names order - Adds missing genes as zero columns - Removes genes not in the data format - Converts X matrix to CSR format with float32 dtype

Parameters:

adata (AnnData) – AnnData object to reorder

Return type:

AnnData

Returns:

New AnnData object with genes reordered to match data format

create_data_format(data_path, adata, row_inds_train, batch_size=500000)#

Create a DataFormat object based on the training set rows.

This sets up the data format metadata including gene names, means, stds, and categorical mappings.

Parameters:
  • data_path (str | Path) – Path to the AnnData file. Required for efficient mean/std calculation.

  • adata (AnnData) – AnnData object with the data.

  • row_inds_train (ndarray) – indices of the training set rows (preferably sorted, for faster runtime).

  • batch_size (int (default: 500000)) – The batch size to use for computing gene means and stds.

Return type:

None

prepare_adata_for_training(adata, *, reorder_genes=False)#

Prepare AnnData object for training.

Parameters:
  • adata (AnnData) – AnnData object to prepare

  • reorder_genes (bool (default: False)) – If True, reorders genes to match the data format.

  • False (If)

  • unchanged. (returns the AnnData)

Return type:

AnnData

Returns:

AnnData object, optionally with genes reordered to match data format.

Note

  • Gene reordering is typically only needed for inference with new data

  • During training, gene ordering is handled efficiently at batch level

  • Preprocessing (normalization, log transform, z-score) is performed

    on-the-fly during batch loading for memory efficiency

reorder_genes_to_match_format(adata)#

Reorder genes in AnnData to match the data format gene order.

This function reorders genes to match self.gene_names, adds missing genes as zero columns, and removes extra genes not in the data format.

Parameters:

adata (AnnData) – AnnData object to reorder

Return type:

AnnData

Returns:

New AnnData object with genes reordered to match data format

save(save_path)#

Save the DataFormat object to a JSON file and numpy arrays to a .npz file.

Return type:

None

aux_categorical_mappings: dict[str, dict[str, int]]#
aux_categorical_types: tuple[str, ...]#
gene_names: list[str]#
genes_mu: ndarray#
genes_sigma: ndarray#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_genes: int#
target_sum: float#
use_log_transform: bool#
use_zscore_norm: bool#
scxpand.data_util.data_format.load_data_format(load_path)#

Load a DataFormat object from saved files.

Loads normalization parameters and gene metadata from JSON and NPZ files created during model training.

Parameters:

load_path (Path) – Path to the JSON file (e.g., ‘data_format.json’). Expects corresponding NPZ file with same basename.

Return type:

DataFormat

Returns:

DataFormat object containing preprocessing parameters and gene statistics.

Example

>>> data_format = load_data_format(Path("results/data_format.json"))
>>> print(f"Loaded {data_format.n_genes} genes")