scxpand.data_util.data_format#
Data format specification and preprocessing parameters for scXpand models.
Functions
|
Load a DataFormat object from saved files. |
Classes
|
Data format specification and preprocessing parameters for scXpand models. |
- class scxpand.data_util.data_format.DataFormat(**data)#
Data format specification and preprocessing parameters for scXpand models.
Contains all metadata and parameters needed to consistently preprocess single-cell expression data. Stores gene information, normalization parameters, and preprocessing settings used during model training.
This class ensures that inference data is processed identically to training data by preserving gene ordering, normalization statistics, and preprocessing pipeline configuration.
- Variables:
n_genes – Number of genes in the dataset.
gene_names – Ordered list of gene names.
genes_mu – Per-gene means for z-score normalization.
genes_sigma – Per-gene standard deviations for z-score normalization.
use_log_transform – Whether to apply log1p transformation.
use_zscore_norm – Whether to apply z-score normalization.
target_sum – Target sum for row normalization (typically 10,000).
aux_categorical_types – Tuple of categorical feature names to include as auxiliary targets.
aux_categorical_mappings – Dictionary mapping categorical features to their integer encodings.
- convert_genes_expression_matrix(adata)#
Reorder genes in AnnData to match the data format gene order.
This function: - Reorders genes to match self.gene_names order - Adds missing genes as zero columns - Removes genes not in the data format - Converts X matrix to CSR format with float32 dtype
- create_data_format(data_path, adata, row_inds_train, batch_size=500000)#
Create a DataFormat object based on the training set rows.
This sets up the data format metadata including gene names, means, stds, and categorical mappings.
- Parameters:
data_path (
str|Path) – Path to the AnnData file. Required for efficient mean/std calculation.adata (
AnnData) – AnnData object with the data.row_inds_train (
ndarray) – indices of the training set rows (preferably sorted, for faster runtime).batch_size (
int(default:500000)) – The batch size to use for computing gene means and stds.
- Return type:
- prepare_adata_for_training(adata, *, reorder_genes=False)#
Prepare AnnData object for training.
- Parameters:
- Return type:
- Returns:
AnnData object, optionally with genes reordered to match data format.
Note
Gene reordering is typically only needed for inference with new data
During training, gene ordering is handled efficiently at batch level
- Preprocessing (normalization, log transform, z-score) is performed
on-the-fly during batch loading for memory efficiency
- reorder_genes_to_match_format(adata)#
Reorder genes in AnnData to match the data format gene order.
This function reorders genes to match self.gene_names, adds missing genes as zero columns, and removes extra genes not in the data format.
- save(save_path)#
Save the DataFormat object to a JSON file and numpy arrays to a .npz file.
- Return type:
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [
ConfigDict][pydantic.config.ConfigDict].
- scxpand.data_util.data_format.load_data_format(load_path)#
Load a DataFormat object from saved files.
Loads normalization parameters and gene metadata from JSON and NPZ files created during model training.
- Parameters:
load_path (
Path) – Path to the JSON file (e.g., ‘data_format.json’). Expects corresponding NPZ file with same basename.- Return type:
- Returns:
DataFormat object containing preprocessing parameters and gene statistics.
Example
>>> data_format = load_data_format(Path("results/data_format.json")) >>> print(f"Loaded {data_format.n_genes} genes")