scxpand.data_util.prepare_data_for_train#

Functions

prepare_data_for_training(data_path[, ...])

Prepare single-cell RNA sequencing data for model training.

Classes

TrainingDataBundle(adata, row_inds_train, ...)

Container for all data components needed for model training.

class scxpand.data_util.prepare_data_for_train.TrainingDataBundle(adata, row_inds_train, row_inds_dev, data_format, save_path)#

Container for all data components needed for model training.

Variables:

adata – Preprocessed AnnData object ready for training.
row_inds_train – Sorted array of row indices for the training set.
row_inds_dev – Sorted array of row indices for the validation set.
data_format – DataFormat object containing preprocessing parameters.
save_path – Path to the directory where training artifacts are saved.

__init__(adata, row_inds_train, row_inds_dev, data_format, save_path)#

adata: AnnData#

data_format: DataFormat#

row_inds_dev: ndarray#

row_inds_train: ndarray#

save_path: Path#

scxpand.data_util.prepare_data_for_train.prepare_data_for_training(data_path, aux_categorical_types=(), use_log_transform=True, use_zscore_norm=True, save_dir='results/temp', dev_ratio=0.2, rand_seed=42, resume=False, batch_size=500000)#

Prepare single-cell RNA sequencing data for model training.

Parameters:

data_path (str | Path) – Path to the input AnnData file (.h5ad format).
aux_categorical_types (tuple[str, ...] (default: ())) – Tuple of categorical feature names to include.
use_log_transform (bool (default: True)) – Whether to apply log1p transformation.
use_zscore_norm (bool (default: True)) – Whether to apply z-score normalization per gene.
save_dir (str | Path (default: 'results/temp')) – Directory to save data format files and split information.
dev_ratio (float (default: 0.2)) – Proportion of data to use for validation (0.0 to 1.0).
rand_seed (int (default: 42)) – Random seed for reproducible data splitting.
resume (bool (default: False)) – If True, load existing data format and splits from save_dir.
batch_size (int (default: 500000)) – Batch size for computing gene statistics.

Return type:

TrainingDataBundle

Returns:

TrainingDataBundle containing all data components needed for training.

Raises:

FileNotFoundError – If data_path doesn’t exist.
ValueError – If dev_ratio is not between 0.0 and 1.0.
KeyError – If required metadata columns are missing from AnnData.obs.

scxpand.data_util.prepare_data_for_train

Contents

scxpand.data_util.prepare_data_for_train#