scxpand.data_util.transforms#
Core expression data transformation functions.
This module provides unified transformation functions that work with NumPy arrays, PyTorch tensors, and sparse matrices for single-cell RNA expression data preprocessing.
Functions
Apply inverse log transform (expm1) to recover original scale data. |
|
|
Apply inverse z-score normalization to recover original scale data. |
|
Apply log1p transformation to expression data (log(x + 1)). |
|
Normalize each cell's total expression to target_sum (always in-place). |
|
Apply robust z-score normalization using precomputed gene statistics. |
|
Extract binary expansion labels from observation data. |
|
Load and preprocess single-cell data using the full normalization pipeline. |
|
Apply complete preprocessing pipeline to expression data. |
- scxpand.data_util.transforms.apply_inverse_log_transform(X)#
Apply inverse log transform (expm1) to recover original scale data.
- Return type:
- scxpand.data_util.transforms.apply_inverse_zscore_normalization(X, genes_mu, genes_sigma, eps=1e-10)#
Apply inverse z-score normalization to recover original scale data.
- Return type:
- scxpand.data_util.transforms.apply_log_transform(X, in_place=True)#
Apply log1p transformation to expression data (log(x + 1)).
- Parameters:
- Return type:
- Returns:
Log-transformed expression matrix (sparse matrices remain sparse)
- scxpand.data_util.transforms.apply_row_normalization(X, target_sum=10000.0)#
Normalize each cell’s total expression to target_sum (always in-place).
- Parameters:
- Return type:
- Returns:
Normalized expression matrix (same object as input, modified in-place)
- scxpand.data_util.transforms.apply_zscore_normalization(X, genes_mu, genes_sigma, eps=1e-10, in_place=True, sigma_clip_factor=6.0)#
Apply robust z-score normalization using precomputed gene statistics.
Uses variance stabilization and outlier-resistant normalization following numerical computing best practices for genomics data.
- Parameters:
X (
ndarray|Tensor|spmatrix) – Expression matrix [n_cells, n_genes]genes_mu (
ndarray|Tensor|spmatrix) – Per-gene means [n_genes]genes_sigma (
ndarray|Tensor|spmatrix) – Per-gene standard deviations [n_genes]eps (
float(default:1e-10)) – Small constant for numerical stabilityin_place (
bool(default:True)) – Whether to modify X in place (ignored for sparse matrices)sigma_clip_factor (
float(default:6.0)) – Factor for robust outlier clipping (default DEFAULT_SIGMA_CLIP_FACTOR)
- Return type:
- Returns:
Z-score normalized expression matrix (always dense due to mean subtraction)
- scxpand.data_util.transforms.extract_is_expanded(obs)#
Extract binary expansion labels from observation data.
Converts expansion status to binary labels (1 for expanded, 0 for not expanded). Looks for ‘expansion’ column containing ‘expanded’ values.
- Parameters:
obs (
DataFrame|Series|dict[str,Series]) – Observation data containing expansion information. Can be DataFrame with ‘expansion’ column, Series of expansion values, or dict with ‘expansion’ key.- Return type:
- Returns:
Binary array where 1 indicates expanded cells, 0 indicates non-expanded.
- Raises:
KeyError – If ‘expansion’ column/key is not found in the data.
Example
>>> labels = extract_is_expanded(adata.obs) >>> print(f"Found {labels.sum()} expanded cells out of {len(labels)}")
- scxpand.data_util.transforms.load_and_preprocess_data_numpy(data_path, data_format, row_indices=None, gene_subset=None)#
Load and preprocess single-cell data using the full normalization pipeline.
Efficiently loads data from disk and applies the complete preprocessing pipeline: row normalization → log transform → z-score normalization. This function is optimized for memory efficiency with large datasets.
- Parameters:
data_path (
str|Path) – Path to H5AD file containing single-cell expression data.data_format (
DataFormat) – DataFormat object with preprocessing parameters and gene statistics.row_indices (
ndarray|None(default:None)) – Specific cell indices to load. If None, loads all cells.gene_subset (
list[str] |list[int] |ndarray|None(default:None)) – Specific genes to subset after preprocessing.names (Can be gene)
- Return type:
- Returns:
Preprocessed expression matrix [n_cells, n_genes] as dense numpy array. Gene order matches data_format.gene_names (or gene_subset if provided).
Example
>>> data_format = load_data_format("results/data_format.json") >>> X = load_and_preprocess_data_numpy("data.h5ad", data_format) >>> print(f"Loaded {X.shape[0]} cells, {X.shape[1]} genes")
- scxpand.data_util.transforms.preprocess_expression_data(X, data_format, eps=1e-10)#
Apply complete preprocessing pipeline to expression data.
Pipeline: Row normalization → Log transform (optional) → Z-score normalization (optional)
Returns the same type as input when possible (tensor in, tensor out).
- Parameters:
X (
ndarray|Tensor|spmatrix) – Raw expression matrix [n_cells, n_genes] with non-negative valuesdata_format (
DataFormat) – DataFormat containing preprocessing parameterseps (
float(default:1e-10)) – Small constant for numerical stability in z-score normalization
- Returns:
torch.Tensor input → torch.Tensor output
numpy/sparse input → numpy.ndarray output
- Return type:
Fully preprocessed expression matrix. Returns same type as input when possible