scxpand.data_util.dataset#
Functions
|
Apply post-normalization augmentations to input tensor. |
|
Apply pre-normalization augmentations to input tensor. |
|
Collate function to efficiently create batches from the dataset using the new transformation system. |
Compute categorical targets directly from observation data using vectorized operations. |
|
|
Compute soft labels for the training data. |
|
Encode categorical features into one-hot vectors. |
|
Encode a single categorical value to an index in the mapping. |
|
Get common DataLoader keyword arguments. |
Classes
|
- class scxpand.data_util.dataset.CellsDataset(data_format, row_inds=None, dataset_params=None, is_train=True, data_path=None, include_row_normalized_gene_counts=False, adata=None)#
- __init__(data_format, row_inds=None, dataset_params=None, is_train=True, data_path=None, include_row_normalized_gene_counts=False, adata=None)#
PyTorch Dataset for single-cell expression data with preprocessing pipeline.
Provides efficient batch loading with on-the-fly preprocessing including normalization, log transformation, and z-score standardization. Supports both file-based and in-memory data access.
- Parameters:
data_format (
DataFormat) – DataFormat object containing preprocessing parameters.row_inds (
ndarray|None(default:None)) – Cell indices to include. If None, includes all cells.dataset_params (
DataAugmentParams|None(default:None)) – Data augmentation parameters. Only used during training.is_train (
bool(default:True)) – Whether this is training data (enables augmentation).data_path (
str|Path|None(default:None)) – Path to H5AD file. Required unless adata is provided.include_row_normalized_gene_counts (
bool(default:False)) – Include raw normalized counts in batches (useful for autoencoder training)adata (
AnnData|None(default:None)) – In-memory AnnData object. Alternative to data_path.
- open_adata(indices)#
Context manager to yield (AnnData object, indices) for batch access.
Uses the utility function for multiprocessing-safe file opening.
- transform_batch_data(X_raw, in_place=True)#
Transform raw batch data according to data format requirements.
- scxpand.data_util.dataset.apply_post_normalization_augmentations(X, noise_std=0.0, generator=None)#
Apply post-normalization augmentations to input tensor.
These augmentations add controlled noise to normalized data.
- Parameters:
- Return type:
- Returns:
Augmented tensor with added noise
- scxpand.data_util.dataset.apply_pre_normalization_augmentations(X, mask_rate=0.0, generator=None)#
Apply pre-normalization augmentations to input tensor.
These augmentations simulate missing data and should be applied to raw counts.
- Parameters:
- Return type:
- Returns:
Augmented tensor with masked values
- scxpand.data_util.dataset.cells_collate_fn(batch_indices, dataset)#
Collate function to efficiently create batches from the dataset using the new transformation system.
- scxpand.data_util.dataset.compute_categorical_targets_from_batch_obs(dataset, batch_obs)#
Compute categorical targets directly from observation data using vectorized operations.
- scxpand.data_util.dataset.compute_soft_labels(obs_df, dataset_params)#
Compute soft labels for the training data.
- Parameters:
obs_df (
DataFrame) – DataFrame containing observation datadataset_params (
DataAugmentParams) – Data augmentation parameters containing soft_loss_betaprm – Param object with the model parameters
- Returns:
A NumPy array of soft labels in the range [0, 1], if y_soft > 0.5, the cell is expanded.
- Return type:
y_soft
- scxpand.data_util.dataset.encode_categorical_features_batch(obs_df, categorical_features_types, categorical_mappings)#
Encode categorical features into one-hot vectors.
- Parameters:
- Return type:
- Returns:
2D numpy array of shape (batch_size, total_categorical_vector_length)
- scxpand.data_util.dataset.encode_categorical_value(value, mapping)#
Encode a single categorical value to an index in the mapping.
- scxpand.data_util.dataset.get_dataloader_kwargs(num_workers, dataset)#
Get common DataLoader keyword arguments.
- Parameters:
num_workers (
int) – Number of worker processesdataset (
CellsDataset) – Dataset to create loader for
- Return type:
- Returns:
Dictionary of common DataLoader arguments