scxpand#

scXpand: Pan-cancer detection of T-cell clonal expansion from single-cell RNA sequencing.

A framework for predicting T-cell clonal expansion from single-cell RNA sequencing data using multiple machine learning approaches including autoencoders, MLPs, LightGBM, and linear models.

class scxpand.ModelType(*values)#

Enumeration of supported model types.

AUTOENCODER = 'autoencoder'#
LIGHTGBM = 'lightgbm'#
LOGISTIC = 'logistic'#
MLP = 'mlp'#
SVM = 'svm'#
scxpand.download_pretrained_model(model_name=None, model_url=None, cache_dir=None)#

Download a pre-trained model and return the path to the extracted model.

Uses Pooch for robust caching, automatic hash verification, and extraction. Pooch automatically computes SHA256 hashes on first download and verifies them on subsequent accesses for integrity checking. When a model is updated (different hash), Pooch automatically downloads the new version to a fresh cache directory, ensuring version updates work seamlessly. Supports both registry models and direct URLs, including DOI URLs.

By default, downloads to a .scxpand_cache directory in the current working directory, making it easy for users to manage and clean up downloaded models.

Parameters:
  • model_name (str | None (default: None)) – Name of pre-trained model from registry (alternative to model_url)

  • model_url (str | None (default: None)) – Direct URL to model file (alternative to model_name)

  • downloads (Supports HTTP/HTTPS URLs for direct)

  • cache_dir (Path | None (default: None)) – Custom cache directory (uses .scxpand_cache in current dir if None)

Return type:

Path

Returns:

Path to the extracted model directory or file

Raises:

ValueError – If neither model_name nor model_url is provided, or if both are provided

Examples

>>> # Registry model (downloads to ./.scxpand_cache/)
>>> model_path = download_pretrained_model(
...     model_name="pan_cancer_autoencoder"
... )
>>>
>>> # Direct URL (downloads to ./.scxpand_cache/)
>>> model_path = download_pretrained_model(
...     model_url="https://your-platform.com/model.zip"
... )
>>>
>>> # Custom cache directory
>>> model_path = download_pretrained_model(
...     model_url="https://figshare.com/ndownloader/files/model.zip",
...     cache_dir=Path("/my/custom/cache"),
... )
scxpand.get_pretrained_model_info(model_name)#

Get information about a pre-trained model.

Parameters:

model_name (str) – Name of the pre-trained model

Return type:

PretrainedModelInfo

Returns:

PretrainedModelInfo object containing model metadata

Raises:

ValueError – If model_name is not found in registry

scxpand.list_pretrained_models()#

List all available pre-trained models with their information.

Return type:

None

scxpand.run_inference(data_path=None, adata=None, model_path=None, model_name=None, model_url=None, save_path=None, batch_size=1024, num_workers=4, eval_row_inds=None)#

Main public API for running inference with scXpand models.

This is the primary entry point for running inference with any type of scXpand model. It automatically detects the model source and routes to the appropriate inference pipeline. Supports local models, registry models, and external models via URL. Metrics are automatically computed when ground truth labels are available in the data.

Parameters:
  • data_path (str | Path | None (default: None)) – Path to input data file (h5ad format). Alternative to adata.

  • adata (AnnData | None (default: None)) – In-memory AnnData object. Alternative to data_path.

  • model_path (str | Path | None (default: None)) – Path to local trained model directory (for local models).

  • model_name (str | None (default: None)) – Name of pre-trained model from registry (for registry models).

  • model_url (str | None (default: None)) – Direct URL to model ZIP file (for any external model).

  • save_path (str | Path | None (default: None)) – Directory to save prediction results (None to skip saving, just return results).

  • batch_size (int (default: 1024)) – Batch size for inference.

  • num_workers (int (default: 4)) – Number of workers for data loading.

  • eval_row_inds (default: None) – Specific cell indices to evaluate (None for all cells, only supported for local models).

Return type:

InferenceResults

Returns:

Structured results containing predictions, metrics (if available), and model info.

Raises:
  • ValueError – If model source is not specified or multiple sources are specified.

  • ValueError – If neither data_path nor adata is provided.

  • FileNotFoundError – If specified files do not exist.

Examples

>>> import scxpand
>>> # Local model inference
>>> results = scxpand.run_inference(
...     data_path="my_data.h5ad", model_path="results/mlp"
... )
>>> print(f"Generated {len(results.predictions)} predictions")
>>> if results.has_metrics:
...     print(f"AUROC: {results.get_auroc():.3f}")
>>> # Registry model inference
>>> results = scxpand.run_inference(
...     data_path="my_data.h5ad", model_name="pan_cancer_autoencoder"
... )
>>> if results.has_model_info:
...     print(f"Model type: {results.model_info.model_type}")
>>> # Direct URL inference (seamless model sharing!)
>>> results = scxpand.run_inference(
...     data_path="my_data.h5ad",
...     model_url="https://your-platform.com/model.zip",
... )
>>> # In-memory inference with any model type (no saving)
>>> import scanpy as sc
>>> adata = sc.read_h5ad("my_data.h5ad")
>>> results = scxpand.run_inference(
...     adata=adata, model_name="pan_cancer_autoencoder", save_path=None
... )
>>> # Results are returned but not saved to disk
scxpand.run_prediction_pipeline(model_path, model_type=None, adata=None, data_path=None, save_path=None, batch_size=1024, num_workers=0, eval_row_inds=None)#

Complete prediction pipeline from model loading to evaluation.

This is the main orchestration function that coordinates the entire prediction workflow. It follows the dependency inversion principle by depending on abstractions (interfaces) rather than concrete implementations.

Parameters:
  • model_path (str | Path) – Path to directory containing the trained model.

  • model_type (ModelType | str | None (default: None)) – Type of model to use for prediction. If None, automatically detected from model_type.txt file in model_path.

  • adata (AnnData | None (default: None)) – In-memory AnnData object (alternative to data_path).

  • data_path (str | Path | None (default: None)) – Path to data file (alternative to adata).

  • save_path (str | Path | None (default: None)) – Directory to save prediction results.

  • batch_size (int (default: 1024)) – Batch size for inference.

  • num_workers (int (default: 0)) – Number of workers for data loading.

  • eval_row_inds (ndarray | None (default: None)) – Specific cell indices to evaluate (None for all).

Return type:

InferenceResults

Returns:

Structured results containing predictions and metrics (if available).

Raises:

Modules

autoencoders

core

data_util

Data utilities for scXpand.

hyperopt

lightgbm

linear

main

Single entry point for all scXpand operations.

mlp

pretrained

Pre-trained model management for scXpand.

util