scxpand.data_util.transforms#

Core expression data transformation functions.

This module provides unified transformation functions that work with NumPy arrays, PyTorch tensors, and sparse matrices for single-cell RNA expression data preprocessing.

Functions

apply_inverse_log_transform(X)

Apply inverse log transform (expm1) to recover original scale data.

apply_inverse_zscore_normalization(X, ...[, eps])

Apply inverse z-score normalization to recover original scale data.

apply_log_transform(X[, in_place])

Apply log1p transformation to expression data (log(x + 1)).

apply_row_normalization(X[, target_sum])

Normalize each cell's total expression to target_sum (always in-place).

apply_zscore_normalization(X, genes_mu, ...)

Apply robust z-score normalization using precomputed gene statistics.

extract_is_expanded(obs)

Extract binary expansion labels from observation data.

load_and_preprocess_data_numpy(data_path, ...)

Load and preprocess single-cell data using the full normalization pipeline.

preprocess_expression_data(X, data_format[, eps])

Apply complete preprocessing pipeline to expression data.

scxpand.data_util.transforms.apply_inverse_log_transform(X)#

Apply inverse log transform (expm1) to recover original scale data.

Return type:

Tensor

scxpand.data_util.transforms.apply_inverse_zscore_normalization(X, genes_mu, genes_sigma, eps=1e-10)#

Apply inverse z-score normalization to recover original scale data.

Return type:

Tensor

scxpand.data_util.transforms.apply_log_transform(X, in_place=True)#

Apply log1p transformation to expression data (log(x + 1)).

Parameters:
  • X (ndarray | Tensor | spmatrix) – Row-normalized Expression matrix [n_cells, n_genes] with non-negative values

  • in_place (bool (default: True)) – Whether to modify X in place

Return type:

ndarray | Tensor | spmatrix

Returns:

Log-transformed expression matrix (sparse matrices remain sparse)

scxpand.data_util.transforms.apply_row_normalization(X, target_sum=10000.0)#

Normalize each cell’s total expression to target_sum (always in-place).

Parameters:
  • X (ndarray | Tensor | spmatrix) – Expression matrix [n_cells, n_genes] - modified in place

  • target_sum (float (default: 10000.0)) – Target sum for each cell after normalization

Return type:

ndarray | Tensor | spmatrix

Returns:

Normalized expression matrix (same object as input, modified in-place)

scxpand.data_util.transforms.apply_zscore_normalization(X, genes_mu, genes_sigma, eps=1e-10, in_place=True, sigma_clip_factor=6.0)#

Apply robust z-score normalization using precomputed gene statistics.

Uses variance stabilization and outlier-resistant normalization following numerical computing best practices for genomics data.

Parameters:
  • X (ndarray | Tensor | spmatrix) – Expression matrix [n_cells, n_genes]

  • genes_mu (ndarray | Tensor | spmatrix) – Per-gene means [n_genes]

  • genes_sigma (ndarray | Tensor | spmatrix) – Per-gene standard deviations [n_genes]

  • eps (float (default: 1e-10)) – Small constant for numerical stability

  • in_place (bool (default: True)) – Whether to modify X in place (ignored for sparse matrices)

  • sigma_clip_factor (float (default: 6.0)) – Factor for robust outlier clipping (default DEFAULT_SIGMA_CLIP_FACTOR)

Return type:

ndarray | Tensor | spmatrix

Returns:

Z-score normalized expression matrix (always dense due to mean subtraction)

scxpand.data_util.transforms.extract_is_expanded(obs)#

Extract binary expansion labels from observation data.

Converts expansion status to binary labels (1 for expanded, 0 for not expanded). Looks for ‘expansion’ column containing ‘expanded’ values.

Parameters:

obs (DataFrame | Series | dict[str, Series]) – Observation data containing expansion information. Can be DataFrame with ‘expansion’ column, Series of expansion values, or dict with ‘expansion’ key.

Return type:

ndarray[int]

Returns:

Binary array where 1 indicates expanded cells, 0 indicates non-expanded.

Raises:

KeyError – If ‘expansion’ column/key is not found in the data.

Example

>>> labels = extract_is_expanded(adata.obs)
>>> print(f"Found {labels.sum()} expanded cells out of {len(labels)}")
scxpand.data_util.transforms.load_and_preprocess_data_numpy(data_path, data_format, row_indices=None, gene_subset=None)#

Load and preprocess single-cell data using the full normalization pipeline.

Efficiently loads data from disk and applies the complete preprocessing pipeline: row normalization → log transform → z-score normalization. This function is optimized for memory efficiency with large datasets.

Parameters:
  • data_path (str | Path) – Path to H5AD file containing single-cell expression data.

  • data_format (DataFormat) – DataFormat object with preprocessing parameters and gene statistics.

  • row_indices (ndarray | None (default: None)) – Specific cell indices to load. If None, loads all cells.

  • gene_subset (list[str] | list[int] | ndarray | None (default: None)) – Specific genes to subset after preprocessing.

  • names (Can be gene)

Return type:

ndarray

Returns:

Preprocessed expression matrix [n_cells, n_genes] as dense numpy array. Gene order matches data_format.gene_names (or gene_subset if provided).

Example

>>> data_format = load_data_format("results/data_format.json")
>>> X = load_and_preprocess_data_numpy("data.h5ad", data_format)
>>> print(f"Loaded {X.shape[0]} cells, {X.shape[1]} genes")
scxpand.data_util.transforms.preprocess_expression_data(X, data_format, eps=1e-10)#

Apply complete preprocessing pipeline to expression data.

Pipeline: Row normalization → Log transform (optional) → Z-score normalization (optional)

Returns the same type as input when possible (tensor in, tensor out).

Parameters:
  • X (ndarray | Tensor | spmatrix) – Raw expression matrix [n_cells, n_genes] with non-negative values

  • data_format (DataFormat) – DataFormat containing preprocessing parameters

  • eps (float (default: 1e-10)) – Small constant for numerical stability in z-score normalization

Returns:

  • torch.Tensor input → torch.Tensor output

  • numpy/sparse input → numpy.ndarray output

Return type:

Fully preprocessed expression matrix. Returns same type as input when possible