Model Architectures Guide#

Note

This guide explains all model architectures available in scXpand and their configuration options.

Overview#

scXpand provides five distinct model architectures, each designed for different use cases and data characteristics. The framework allows you to experiment with multiple approaches for T-cell expansion prediction tasks.

Available Model Architectures#

Neural Network Models:

Autoencoder: Deep count autoencoder with reconstruction and classification heads
MLP: Multi-layer perceptron for direct expansion prediction

Gradient Boosting:

LightGBM: Gradient boosted decision trees optimized for tabular data

Linear Models:

Logistic Regression: Linear classifier with logistic loss
SVM: Support vector machine with hinge loss

Autoencoder-Based Models#

Architecture Overview#

scXpand’s autoencoder architecture is inspired by the Deep Count Autoencoder (DCA) approach introduced by Eraslan et al. (2019) for single-cell RNA-seq data denoising. Our implementation extends this concept by combining reconstruction learning with expansion classification.

from scxpand.autoencoders.ae_models import AutoencoderModel
from scxpand.autoencoders.ae_params import AutoEncoderParams

# Create autoencoder model
params = AutoEncoderParams(
    model_type="standard",           # or "fork"
    loss_type="zinb",               # "mse", "nb", or "zinb"
    latent_dim=32,
    encoder_hidden_dims=(128, 64),
    decoder_hidden_dims=(64, 128)
)

Scientific Foundation#

The autoencoder approach addresses several key challenges in single-cell data analysis:

Count Data Distribution: Single-cell RNA-seq data follows count distributions (Negative Binomial, Zero-inflated Negative Binomial) rather than Gaussian distributions assumed by traditional methods.
Zero-Inflation: The high sparsity in single-cell data requires specialized handling of true biological zeros vs. technical dropouts.
Overdispersion: Gene expression exhibits variance greater than the mean.

As described in the DCA paper, the autoencoder learns to map noisy observations back to an underlying “clean” data manifold, effectively denoising the expression while preserving biological signal.

Architecture Variants#

scXpand provides two distinct autoencoder architectures that differ fundamentally in how they handle the decoder pathway for reconstruction tasks.

Standard Autoencoder

Uses a shared decoder pathway with multiple output heads:

Input (genes) → Encoder → Latent → Shared Decoder → Mean Head (μ)
                            ↓                    → Pi Head (π)
                        Classifier              → Theta Head (θ)
                            ↓
                    Expansion Prediction

Fork Autoencoder

Uses separate decoder pathways for each reconstruction parameter:

Input (genes) → Encoder → Latent → Mean Decoder → Mean Head (μ)
                            ↓   → Pi Decoder → Pi Head (π)
                        Classifier → Theta Decoder → Theta Head (θ)
                            ↓
                    Expansion Prediction

Loss Functions#

scXpand supports three loss functions for the reconstruction component:

Mean Squared Error (MSE)

Traditional L2 loss:

\[\mathcal{L}_{MSE} = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{G} (x_{ij} - \mu_{ij})^2\]

Negative Binomial (NB)

Accounts for overdispersion in count data:

\[\mathcal{L}_{NB} = -\sum_{i=1}^{N} \sum_{j=1}^{G} \log \text{NB}(x_{ij}; \mu_{ij}, \theta_{ij})\]

Zero-Inflated Negative Binomial (ZINB)

Handles both overdispersion and zero-inflation:

\[\mathcal{L}_{ZINB} = -\sum_{i=1}^{N} \sum_{j=1}^{G} \log \text{ZINB}(x_{ij}; \mu_{ij}, \theta_{ij}, \pi_{ij})\]

Where: - \(\mu_{ij}\): Mean expression for gene j in cell i - \(\theta_{ij}\): Dispersion parameter - \(\pi_{ij}\): Zero-inflation probability

Multi-Layer Perceptron (MLP)#

Architecture Design#

The MLP model provides a direct approach to expansion prediction without reconstruction learning. It uses fully connected layers with dropout regularization and optional auxiliary classification heads.

from scxpand.mlp.mlp_params import MLPParam
from scxpand.mlp.mlp_model import MLPModel

# Configure MLP
mlp_params = MLPParam(
    layer_units=[512, 256, 128, 64],    # Hidden layer sizes
    dropout_rate=0.3,
    learning_rate=1e-3,
    n_epochs=30
)

Architecture Flow:

Input (genes) → FC Layer 1 → Dropout → ReLU
              → FC Layer 2 → Dropout → ReLU
              → ...
              → Output Layer → Sigmoid → Expansion Probability

Configuration Options#

mlp_config = {
    # Architecture
    "layer_units": [1024, 512, 256, 128],  # Layer sizes
    "dropout_rate": 0.25,                  # Regularization

    # Training
    "learning_rate": 5e-4,
    "weight_decay": 1e-4,
    "n_epochs": 25,
    "batch_size": 2048,

    # Data augmentation
    "mask_rate": 0.1,                      # Gene masking
    "noise_std": 1e-4,                     # Gaussian noise

    # Loss function
    "positives_weight": 2.0,               # Class imbalance handling
    "use_soft_loss": True                  # Soft vs hard labels
}

LightGBM Models#

Gradient Boosting Approach#

LightGBM provides a non-neural approach using gradient boosted decision trees. This method excels on tabular data and often serves as a strong baseline for genomics applications.

from scxpand.lightgbm.lightgbm_params import LightGBMParams

# Configure LightGBM
lgbm_params = LightGBMParams(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=8,
    num_leaves=64,
    class_weight="balanced"
)

Configuration Parameters#

lightgbm_config = {
    # Tree structure
    "n_estimators": 150,               # Number of trees
    "max_depth": 10,                   # Maximum tree depth
    "num_leaves": 31,                  # Maximum leaves per tree

    # Learning
    "learning_rate": 0.05,             # Shrinkage rate
    "feature_fraction": 0.8,           # Feature sampling
    "bagging_fraction": 0.8,           # Row sampling

    # Regularization
    "reg_alpha": 0.1,                  # L1 regularization
    "reg_lambda": 0.1,                 # L2 regularization
    "min_child_samples": 20,           # Minimum samples per leaf

    # Class imbalance
    "class_weight": "balanced",        # Auto weight adjustment
    "is_unbalance": True
}

Linear Models#

Logistic Regression#

Classic linear model using logistic loss function for binary classification. Provides interpretable coefficients and fast training.

from scxpand.linear.linear_params import LinearClassifierParam

# Configure logistic regression
logistic_params = LinearClassifierParam(
    model_type="logistic",                 # 'logistic' or 'svm'
    alpha=0.0001,                          # Regularization strength
    penalty="l2",                          # L1, L2, or elasticnet
    n_epochs=1000,
    class_weight="balanced"
)

Mathematical Model:

\[P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \sum_{j=1}^{p} \beta_j x_j)}}\]

Support Vector Machine (SVM)#

Linear SVM using hinge loss, optimized for maximum margin classification.

# Configure SVM
svm_params = LinearClassifierParam(
    model_type="svm",                      # 'svm' or 'logistic'
    alpha=0.0001,                          # Regularization strength
    penalty="l2",                          # Regularization type
    class_weight="balanced"
)

Mathematical Objective:

\[\min_{w,b} \frac{1}{2}||w||^2 + C\sum_{i=1}^{n} \max(0, 1 - y_i(w^T x_i + b))\]

Multi-task Learning#

Both autoencoder and MLP models support auxiliary classification tasks for predicting cell types or tissue types alongside expansion:

# Enable auxiliary classification
params = AutoEncoderParams(
    aux_categorical_types=("tissue_type", "imputed_labels"),
    cat_loss_weight=0.5                # Weight for auxiliary losses
)

Model Architectures Guide

Contents

Model Architectures Guide#

Overview#

Available Model Architectures#

Autoencoder-Based Models#

Architecture Overview#

Scientific Foundation#

Architecture Variants#

Loss Functions#

Multi-Layer Perceptron (MLP)#

Architecture Design#

Configuration Options#

LightGBM Models#

Gradient Boosting Approach#

Configuration Parameters#

Linear Models#

Logistic Regression#

Support Vector Machine (SVM)#

Multi-task Learning#