scxpand.data_util.prepare_data_for_train#

Functions

prepare_data_for_training(data_path[, ...])

Prepare single-cell RNA sequencing data for model training.

Classes

TrainingDataBundle(adata, row_inds_train, ...)

Container for all data components needed for model training.

class scxpand.data_util.prepare_data_for_train.TrainingDataBundle(adata, row_inds_train, row_inds_dev, data_format, save_path)#

Container for all data components needed for model training.

Variables:
  • adata – Preprocessed AnnData object ready for training.

  • row_inds_train – Sorted array of row indices for the training set.

  • row_inds_dev – Sorted array of row indices for the validation set.

  • data_format – DataFormat object containing preprocessing parameters.

  • save_path – Path to the directory where training artifacts are saved.

__init__(adata, row_inds_train, row_inds_dev, data_format, save_path)#
adata: AnnData#
data_format: DataFormat#
row_inds_dev: ndarray#
row_inds_train: ndarray#
save_path: Path#
scxpand.data_util.prepare_data_for_train.prepare_data_for_training(data_path, aux_categorical_types=(), use_log_transform=True, use_zscore_norm=True, save_dir='results/temp', dev_ratio=0.2, rand_seed=42, resume=False, batch_size=500000)#

Prepare single-cell RNA sequencing data for model training.

Parameters:
  • data_path (str | Path) – Path to the input AnnData file (.h5ad format).

  • aux_categorical_types (tuple[str, ...] (default: ())) – Tuple of categorical feature names to include.

  • use_log_transform (bool (default: True)) – Whether to apply log1p transformation.

  • use_zscore_norm (bool (default: True)) – Whether to apply z-score normalization per gene.

  • save_dir (str | Path (default: 'results/temp')) – Directory to save data format files and split information.

  • dev_ratio (float (default: 0.2)) – Proportion of data to use for validation (0.0 to 1.0).

  • rand_seed (int (default: 42)) – Random seed for reproducible data splitting.

  • resume (bool (default: False)) – If True, load existing data format and splits from save_dir.

  • batch_size (int (default: 500000)) – Batch size for computing gene statistics.

Return type:

TrainingDataBundle

Returns:

TrainingDataBundle containing all data components needed for training.

Raises:
  • FileNotFoundError – If data_path doesn’t exist.

  • ValueError – If dev_ratio is not between 0.0 and 1.0.

  • KeyError – If required metadata columns are missing from AnnData.obs.