scxpand.data_util.transforms#

Core expression data transformation functions.

This module provides unified transformation functions that work with NumPy arrays, PyTorch tensors, and sparse matrices for single-cell RNA expression data preprocessing.

Functions

`apply_inverse_log_transform`(X)	Apply inverse log transform (expm1) to recover original scale data.
`apply_inverse_zscore_normalization`(X, ...[, eps])	Apply inverse z-score normalization to recover original scale data.
`apply_log_transform`(X[, in_place])	Apply log1p transformation to expression data (log(x + 1)).
`apply_row_normalization`(X[, target_sum])	Normalize each cell's total expression to target_sum (always in-place).
`apply_zscore_normalization`(X, genes_mu, ...)	Apply robust z-score normalization using precomputed gene statistics.
`extract_is_expanded`(obs)	Extract binary expansion labels from observation data.
`load_and_preprocess_data_numpy`(data_path, ...)	Load and preprocess single-cell data using the full normalization pipeline.
`preprocess_expression_data`(X, data_format[, eps])	Apply complete preprocessing pipeline to expression data.

scxpand.data_util.transforms.apply_inverse_log_transform(X)#

Apply inverse log transform (expm1) to recover original scale data.

Return type:: Tensor

scxpand.data_util.transforms.apply_inverse_zscore_normalization(X, genes_mu, genes_sigma, eps=1e-10)#

Apply inverse z-score normalization to recover original scale data.

Return type:: Tensor

scxpand.data_util.transforms.apply_log_transform(X, in_place=True)#

Apply log1p transformation to expression data (log(x + 1)).

Parameters:

X (ndarray | Tensor | spmatrix) – Row-normalized Expression matrix [n_cells, n_genes] with non-negative values
in_place (bool (default: True)) – Whether to modify X in place

Return type:

ndarray | Tensor | spmatrix

Returns:

Log-transformed expression matrix (sparse matrices remain sparse)

scxpand.data_util.transforms.apply_row_normalization(X, target_sum=10000.0)#

Normalize each cell’s total expression to target_sum (always in-place).

Parameters:

X (ndarray | Tensor | spmatrix) – Expression matrix [n_cells, n_genes] - modified in place
target_sum (float (default: 10000.0)) – Target sum for each cell after normalization

Return type:

ndarray | Tensor | spmatrix

Returns:

Normalized expression matrix (same object as input, modified in-place)

scxpand.data_util.transforms.apply_zscore_normalization(X, genes_mu, genes_sigma, eps=1e-10, in_place=True, sigma_clip_factor=6.0)#

Apply robust z-score normalization using precomputed gene statistics.

Uses variance stabilization and outlier-resistant normalization following numerical computing best practices for genomics data.

Parameters:

X (ndarray | Tensor | spmatrix) – Expression matrix [n_cells, n_genes]
genes_mu (ndarray | Tensor | spmatrix) – Per-gene means [n_genes]
genes_sigma (ndarray | Tensor | spmatrix) – Per-gene standard deviations [n_genes]
eps (float (default: 1e-10)) – Small constant for numerical stability
in_place (bool (default: True)) – Whether to modify X in place (ignored for sparse matrices)
sigma_clip_factor (float (default: 6.0)) – Factor for robust outlier clipping (default DEFAULT_SIGMA_CLIP_FACTOR)

Return type:

ndarray | Tensor | spmatrix

Returns:

Z-score normalized expression matrix (always dense due to mean subtraction)

scxpand.data_util.transforms.extract_is_expanded(obs)#

Extract binary expansion labels from observation data.

Converts expansion status to binary labels (1 for expanded, 0 for not expanded). Looks for ‘expansion’ column containing ‘expanded’ values.

Parameters:: obs (DataFrame | Series | dict[str, Series]) – Observation data containing expansion information. Can be DataFrame with ‘expansion’ column, Series of expansion values, or dict with ‘expansion’ key.
Return type:: ndarray[int]
Returns:: Binary array where 1 indicates expanded cells, 0 indicates non-expanded.
Raises:: KeyError – If ‘expansion’ column/key is not found in the data.

Example

>>> labels = extract_is_expanded(adata.obs)
>>> print(f"Found {labels.sum()} expanded cells out of {len(labels)}")

scxpand.data_util.transforms.load_and_preprocess_data_numpy(data_path, data_format, row_indices=None, gene_subset=None)#

Load and preprocess single-cell data using the full normalization pipeline.

Efficiently loads data from disk and applies the complete preprocessing pipeline: row normalization → log transform → z-score normalization. This function is optimized for memory efficiency with large datasets.

Parameters:

data_path (str | Path) – Path to H5AD file containing single-cell expression data.
data_format (DataFormat) – DataFormat object with preprocessing parameters and gene statistics.
row_indices (ndarray | None (default: None)) – Specific cell indices to load. If None, loads all cells.
gene_subset (list[str] | list[int] | ndarray | None (default: None)) – Specific genes to subset after preprocessing.
names (Can be gene)

Return type:

ndarray

Returns:

Preprocessed expression matrix [n_cells, n_genes] as dense numpy array. Gene order matches data_format.gene_names (or gene_subset if provided).

Example

>>> data_format = load_data_format("results/data_format.json")
>>> X = load_and_preprocess_data_numpy("data.h5ad", data_format)
>>> print(f"Loaded {X.shape[0]} cells, {X.shape[1]} genes")

scxpand.data_util.transforms.preprocess_expression_data(X, data_format, eps=1e-10)#

Apply complete preprocessing pipeline to expression data.

Pipeline: Row normalization → Log transform (optional) → Z-score normalization (optional)

Returns the same type as input when possible (tensor in, tensor out).

Parameters:

X (ndarray | Tensor | spmatrix) – Raw expression matrix [n_cells, n_genes] with non-negative values
data_format (DataFormat) – DataFormat containing preprocessing parameters
eps (float (default: 1e-10)) – Small constant for numerical stability in z-score normalization

Returns:

torch.Tensor input → torch.Tensor output
numpy/sparse input → numpy.ndarray output

Return type:

Fully preprocessed expression matrix. Returns same type as input when possible

scxpand.data_util.transforms

Contents

scxpand.data_util.transforms#