scxpand.data_util.prepare_data_for_train#
Functions
|
Prepare single-cell RNA sequencing data for model training. |
Classes
|
Container for all data components needed for model training. |
- class scxpand.data_util.prepare_data_for_train.TrainingDataBundle(adata, row_inds_train, row_inds_dev, data_format, save_path)#
Container for all data components needed for model training.
- Variables:
adata – Preprocessed AnnData object ready for training.
row_inds_train – Sorted array of row indices for the training set.
row_inds_dev – Sorted array of row indices for the validation set.
data_format – DataFormat object containing preprocessing parameters.
save_path – Path to the directory where training artifacts are saved.
- __init__(adata, row_inds_train, row_inds_dev, data_format, save_path)#
-
data_format:
DataFormat#
- scxpand.data_util.prepare_data_for_train.prepare_data_for_training(data_path, aux_categorical_types=(), use_log_transform=True, use_zscore_norm=True, save_dir='results/temp', dev_ratio=0.2, rand_seed=42, resume=False, batch_size=500000)#
Prepare single-cell RNA sequencing data for model training.
- Parameters:
data_path (
str|Path) – Path to the input AnnData file (.h5ad format).aux_categorical_types (
tuple[str,...] (default:())) – Tuple of categorical feature names to include.use_log_transform (
bool(default:True)) – Whether to apply log1p transformation.use_zscore_norm (
bool(default:True)) – Whether to apply z-score normalization per gene.save_dir (
str|Path(default:'results/temp')) – Directory to save data format files and split information.dev_ratio (
float(default:0.2)) – Proportion of data to use for validation (0.0 to 1.0).rand_seed (
int(default:42)) – Random seed for reproducible data splitting.resume (
bool(default:False)) – If True, load existing data format and splits from save_dir.batch_size (
int(default:500000)) – Batch size for computing gene statistics.
- Return type:
- Returns:
TrainingDataBundle containing all data components needed for training.
- Raises:
FileNotFoundError – If data_path doesn’t exist.
ValueError – If dev_ratio is not between 0.0 and 1.0.
KeyError – If required metadata columns are missing from AnnData.obs.