Data Pipeline & Normalization#
Note
This document provides an overview of the scXpand data processing pipeline, from raw single-cell RNA sequencing data to model-ready normalized expression matrices.
Overview#
The scXpand data pipeline transforms raw single-cell gene expression data through a series of preprocessing steps to prepare it for machine learning models. The pipeline ensures consistent data processing between training and inference while maintaining computational efficiency for large datasets.
Pipeline Architecture#
The data pipeline consists of three main components:
Data Format Management: Central configuration for preprocessing parameters
Preprocessing Pipeline: Sequential normalization and transformation steps
Data Loading System: Efficient batch processing for training and inference
Core Components#
DataFormat Class#
The DataFormat class serves as the central configuration hub for all data preprocessing operations. It ensures consistency between training and inference by storing:
Gene Information: Gene names and ordering
Normalization Statistics: Per-gene means and standard deviations for z-score normalization
Preprocessing Parameters: Log transformation and normalization settings
Metadata: Categorical feature mappings
Preprocessing Pipeline#
The preprocessing pipeline applies three sequential transformations to raw gene expression data:
Notation: Throughout this section, \(X_{raw}[i,j]\) represents the raw unique molecular identifier (UMI) count for cell i and gene j from the single-cell RNA sequencing experiment.
- Step 1: Row Normalization
Normalizes each cell’s total gene expression to a target sum (default: 10,000). This accounts for differences in sequencing depth between cells.
\[X_{norm}[i,j] = X_{raw}[i,j] \times \frac{\text{target_sum}}{\sum_k X_{raw}[i,k]}\]- Step 2: Log Transformation (Optional)
Applies log1p transformation to reduce the impact of highly expressed genes and stabilize variance.
\[X_{log}[i,j] = \log(1 + X_{norm}[i,j])\]- Step 3: Z-Score Normalization (Optional)
Standardizes gene expression using per-gene statistics computed from the training set.
\[X_{zscore}[i,j] = \frac{X_{log}[i,j] - \mu_j}{\sigma_j + \epsilon}\]Where \(\mu_j\) and \(\sigma_j\) are the mean and standard deviation of gene j computed from training data.
Data Input Modes#
scXpand supports two data input modes for different use cases:
File-Based Mode (Memory Efficient)#
When to use: Large datasets that don’t fit in memory (>10GB)
How it works: Data is loaded in batches directly from disk using AnnData’s backed mode. Only the required cells and genes are loaded into memory at any given time.
- Advantages:
Memory efficient for very large datasets
Scales to datasets with millions of cells
Automatic memory management
- Considerations:
Slower than in-memory mode due to disk I/O
Requires data to be stored in HDF5 format
In-Memory Mode (High Performance)#
When to use: Smaller datasets that fit comfortably in RAM (<5GB)
How it works: The entire dataset is loaded into memory once, enabling faster batch access during training or inference.
- Advantages:
Faster data access during training/inference
No disk I/O bottlenecks
Better for iterative model development
- Considerations:
Memory usage scales with dataset size
May cause out-of-memory errors with large datasets
Normalization Details#
Row Normalization#
Row normalization addresses the technical variability in sequencing depth between cells. Without normalization, cells with higher total read counts would appear to have higher expression across all genes.
Log Transformation#
Log transformation helps with:
Reducing the impact of highly expressed genes
Stabilizing variance across the expression range
Making the data more suitable for downstream analysis
Z-Score Normalization#
Z-score normalization standardizes each gene’s expression across cells using training set statistics. This step:
Centers each gene’s expression around zero
Scales each gene to unit variance
Uses robust clipping to handle outliers (±3σ by default)
Adds small epsilon for numerical stability
- Gene Statistics Computation:
The per-gene means (μ) and standard deviations (σ) are computed once from the training set using the same preprocessing steps (row normalization and optional log transformation) but without masking or noise augmentation. These statistics are then saved in DataFormat and used for all future processing.
Gene Format Handling#
scXpand automatically handles cases where inference data has different gene ordering or subsets compared to training data.
- Gene Reordering Process:
Compare gene names between datasets
Create mapping from inference to training gene order
Reorder expression matrix columns
Handle missing genes by zero-padding
Gene Subsetting: For inference on specific gene subsets, the system automatically filters to only include genes present in the training data.
Data Augmentation#
Data augmentation is used only during training for neural network models (MLP and Autoencoder) and linear models (Logistic regression and SVM) to improve generalization and robustness.
- Training Pipeline Sequence:
Load raw expression data from AnnData file
Apply pre-normalization augmentations (gene masking)
Apply core preprocessing pipeline:
Row normalization (target_sum = 10,000)
Log transformation (if enabled)
Z-score normalization (if enabled) using pre-computed training statistics
Apply post-normalization augmentations (Gaussian noise addition)
Augmentation Types:
Gene Masking (Pre-normalization):
Randomly sets genes to zero before any normalization steps
Simulates technical dropouts in single-cell data
Gaussian Noise (Post-normalization):
Adds small amounts of Gaussian noise to fully normalized expression data
Uses a small standard deviation (typically 1e-4) appropriate for normalized data scale
Helps prevent overfitting and improves generalization
Soft Labels:
Uses continuous labels in [0,1] instead of binary {0,1} labels
Computed from clone size ratios using sigmoid scaling
Formula:
sigmoid(soft_loss_beta * (clone_size_ratio - 1.5))Helps with label noise and improves model calibration
Important Notes:
During inference, no augmentations are applied - only the core preprocessing pipeline runs
Gene statistics (μ, σ) for z-score normalization are precomputed once from clean training data (without masking or noise) and reused for all inference
Genes from training that are missing in inference data are filled with zeros and normalized using their training statistics
Genes in inference data that were not in training are discarded (only training genes are processed)
Inference Data Format Handling#
The scXpand inference pipeline is designed to handle test data with different formats, gene sets, and structures than the training data while maintaining consistency with the training preprocessing pipeline.
Gene Format Standardization Process:
Gene Mapping and Reordering: All inference data goes through automatic gene format standardization - Genes are reordered to match
data_format.gene_names- Missing genes are added as zero columns at correct positions - Extra genes are removed - Final gene count matches training format exactlyPreprocessing Pipeline: The same preprocessing pipeline as training is applied - Row normalization: Each cell sums to
target_sum(typically 10,000) - Log transformation:log1p()for variance stabilization - Z-score normalization: Per-gene normalization using precomputedgenes_mu[i]andgenes_sigma[i]
Example: Complex Gene Mismatch Handling
Training Data Format:
training_genes = ["GENE_A", "GENE_B", "GENE_C", "GENE_D"]
genes_mu = [100.0, 10.0, 50.0, 5.0]
genes_sigma = [20.0, 100.0, 30.0, 200.0]
Test Data (Complex Mismatch):
test_genes = ["GENE_C", "GENE_A", "EXTRA_1", "GENE_E", "EXTRA_2"]
# Missing: GENE_B, GENE_D
# Extra: EXTRA_1, EXTRA_2, GENE_E
# Reordered: GENE_C, GENE_A
Transformation Process:
Gene mapping: GENE_A → position 0, GENE_C → position 2
Missing genes: GENE_B (position 1), GENE_D (position 3) filled with zeros
Extra genes: EXTRA_1, EXTRA_2, GENE_E ignored
Result:
[100.0, 0.0, 50.0, 0.0](missing genes filled with zeros)Preprocessing: Row norm → log → z-score using training statistics