Model Training#

Learn how to train scXpand models for T-cell expansion prediction.

Overview#

scXpand supports five model architectures for training:

  • Neural Networks: Autoencoder (with reconstruction + classification) and MLP (direct prediction)

  • Gradient Boosting: LightGBM (optimized for tabular data)

  • Linear Models: Logistic Regression and SVM (linear classifiers)

For detailed architecture descriptions, configuration options, and scientific foundations, see Model Architectures Guide.

Training with the Command-Line#

You can train any of the supported models directly from the command line.

# Autoencoder training
python -m scxpand.main train --model_type autoencoder --data_path data/example_data.h5ad --n_epochs 100

# MLP training
python -m scxpand.main train --model_type mlp --data_path data/example_data.h5ad --n_epochs 50

# LightGBM training (no epochs needed)
python -m scxpand.main train --model_type lightgbm --data_path data/example_data.h5ad

# Logistic Regression training
python -m scxpand.main train --model_type logistic --data_path data/example_data.h5ad

# SVM training
python -m scxpand.main train --model_type svm --data_path data/example_data.h5ad

Using a Configuration File#

For more complex configurations, you can use a JSON configuration file to specify parameters for a training run. This is useful for keeping track of different experimental setups.

Pass the file using the --config_path argument:

python -m scxpand.main train --model_type autoencoder --config_path my_ae_config.json

Parameter Precedence: Parameters are loaded in the following order (last one wins): 1. Default parameters in the model’s Param class. 2. Parameters from your JSON configuration file. 3. Keyword arguments passed directly on the command line (e.g., --n_epochs 100).

This means a command-line argument will always override a setting in your config file.

Example `my_ae_config.json`:

{
    "latent_dim": 64,
    "n_epochs": 100,
    "init_learning_rate": 1e-4,
    "encoder_hidden_dims": [128, 64],
    "decoder_hidden_dims": [64, 128],
    "dropout_rate": 0.2
}

Training API Reference#

Train a single model with specific configuration:

scxpand.main.train(model_type, data_path='data/example_data.h5ad', save_dir=None, config_path=None, resume=False, num_workers=4, **kwargs)#

Train a single model.

Parameters:
  • model_type (ModelType | str) – Type of model to train (autoencoder, mlp, lightgbm, logistic, svm).

  • data_path (str (default: 'data/example_data.h5ad')) – Path to input data file.

  • save_dir (str | None (default: None)) – Directory to save results (if None, uses default for model type).

  • config_path (str | None (default: None)) – Path to configuration file.

  • resume (bool (default: False)) – Whether to resume from existing checkpoint.

  • num_workers (int (default: 4)) – Number of workers for data loading.

  • **kwargs (Any) – Additional parameters to override config.

Return type:

None

Returns:

None.

Examples

>>> # Autoencoder training
>>> python -m scxpand.main train --model_type autoencoder --data_path data/example_data.h5ad --n_epochs 100
>>>
>>> # MLP training
>>> python -m scxpand.main train --model_type mlp --data_path data/example_data.h5ad --n_epochs 50
>>>
>>> # LightGBM training (no epochs needed)
>>> python -m scxpand.main train --model_type lightgbm --data_path data/example_data.h5ad
>>>
>>> # Linear model training
>>> python -m scxpand.main train --model_type linear --data_path data/example_data.h5ad
>>>
>>> # SVM training with custom config
>>> python -m scxpand.main train --model_type svm --data_path data/example_data.h5ad --config_path config/svm_config.json

Training Monitoring#

Monitor training progress with TensorBoard:

# Start TensorBoard (view all training runs)
tensorboard --logdir=results/

# Or view a specific model type
tensorboard --logdir=results/pan_cancer_autoencoder_v_0/

# Access dashboard at http://localhost:6006

For detailed CLI examples and usage, see the scxpand.main module documentation.