Hyperparameter Optimization Guide#
Overview#
scXpand provides automated hyperparameter optimization using Optuna, an optimization framework.
- Key Features:
Bayesian Optimization: Parameter space exploration using Tree-structured Parzen Estimator (TPE)
Early Pruning: Automatic termination of unpromising trials
Resume Capability: Continue optimization from previous studies
Model-Specific Grids: Tailored parameter spaces for each architecture
Reproducibility and Data Splitting#
The hyperparameter optimization system ensures reproducible results through a fixed random seed approach:
Fixed Random Seed System:
- All trials use the same base random seed (default: 42)
- The seed is set globally using set_seed()
Consistent Train/Validation Splits: - The same random seed ensures identical train/validation splits across all trials - Data splitting is performed at the patient level - Split indices are saved and reused when resuming optimization - This allows fair comparison between different hyperparameter combinations
Quick Start#
Basic Optimization#
Optimize a single model with default settings:
# Optimize MLP model
python -m scxpand.main optimize \
--model_type mlp \
--data_path data/example_data.h5ad \
--n_trials 100 \
--study_name "mlp_optimization"
# Optimize autoencoder with custom configuration
python -m scxpand.main optimize \
--model_type autoencoder \
--data_path data/example_data.h5ad \
--n_trials 200 \
--study_name "autoencoder_deep_search" \
--num_workers 4
Optimization of Multiple Models#
Compare all available architectures:
# Optimize all model types in parallel
python -m scxpand.main optimize-all \
--data_path data/example_data.h5ad \
--n_trials 50 \
--num_workers 4
This command will create separate optimization studies for each model type (autoencoder, mlp, lightgbm, logistic, svm) and run them sequentially.
Results Analysis#
Study Outputs and Structure#
Each optimization study creates outputs for analysis:
results/optuna_studies/study_name/
├── optuna.db # SQLite database with all trials
├── info.json # Study summary and best trial info
├── trials.log # Detailed trial execution logs
├── trial_0/ # Individual trial results
│ ├── parameters.json # Trial hyperparameters
│ ├── results_dev.txt # Evaluation metrics (text format)
│ ├── results_table_dev.csv # Per-cell predictions and metadata
│ ├── summary_info.json # Trial summary and run metadata
│ ├── data_format.json # Gene names and preprocessing statistics
│ ├── data_format.npz # Numerical normalization statistics
│ ├── data_splits.npz # Train/dev/test splits
│ ├── model.joblib # Trained model (sklearn: LightGBM, Logistic, SVM)
│ ├── best_ckpt.pt # Best checkpoint (neural networks: MLP, Autoencoder)
│ ├── last_ckpt.pt # Latest checkpoint (neural networks: MLP, Autoencoder)
│ └── plots/
│ └── roc_curve_dev.png # ROC curve visualization
└── trial_N/... # Additional trials
Key Files:
info.json: Contains study metadata, trial counts, best trial parameters, and complete results
parameters.json: Complete hyperparameter configuration for each trial
results_dev.txt: Hierarchical evaluation metrics (AUROC, accuracy, etc.) by category
results_table_dev.csv: Per-cell predictions with metadata for detailed analysis
summary_info.json: Trial summary, best epoch information, and run metadata
data_format.json: Gene names and data preprocessing statistics
data_format.npz: Numerical data normalization statistics for consistent preprocessing
data_splits.npz: Train/dev/test splits for reproducible evaluation
optuna.db: SQLite database with complete trial history for programmatic access
model.joblib: Trained model file (sklearn models: LightGBM, Logistic, SVM)
best_ckpt.pt: Best model checkpoint (neural networks: MLP, Autoencoder)
last_ckpt.pt: Latest checkpoint (neural networks: MLP, Autoencoder)
Inspecting Optimization Results#
scXpand provides multiple ways to inspect and analyze your hyperparameter optimization results:
- Summary Files
Each study automatically generates summary files for quick inspection:
Study Summary (
info.json): Contains best trial information, parameter values, and overall study statisticsTrial Logs (
trials.log): Detailed execution logs showing progress of each trialBest Trial Results (
trial_N/results_dev.txt): Complete evaluation metrics for the best performing trial
These files are located in your study directory:
results/optuna_studies/{study_name}/- Optuna Dashboard
For interactive visualization and analysis, use the Optuna Dashboard:
# Install Optuna Dashboard
pip install optuna-dashboard
# Launch dashboard for your study
optuna-dashboard sqlite:///results/optuna_studies/mlp_optimization/optuna.db
# Access the dashboard at http://localhost:8080
The Optuna Dashboard provides rich visualizations including parameter importance plots, optimization history, and interactive parameter relationships. For more details, see the Optuna Dashboard documentation.
Loading and Analyzing Results#
Access study results programmatically:
import json
import optuna
import pandas as pd
from pathlib import Path
# Load study from database
study = optuna.load_study(
study_name="mlp_optimization",
storage="sqlite:///results/optuna_studies/mlp_optimization/optuna.db"
)
# Load study summary
with open("results/optuna_studies/mlp_optimization/info.json") as f:
study_info = json.load(f)
print(f"Best trial: {study_info['best_trial_number']}")
print(f"Best value: {study_info['best_value']:.4f}")
print(f"Completed trials: {study_info['completed_trials']}")
# Load best trial detailed results
best_trial_dir = f"results/optuna_studies/mlp_optimization/trial_{study_info['best_trial_number']}"
# Load per-cell predictions for analysis
predictions_df = pd.read_csv(f"{best_trial_dir}/results_table_dev.csv")
# Load complete evaluation metrics
with open(f"{best_trial_dir}/results_dev.txt") as f:
detailed_metrics = f.read()
Training with Best Parameters#
Use the optimized parameters to train a new model:
Fresh Training (Default)
# Train a new model from scratch using the best parameters
python -m scxpand.main train \
--model_type mlp \
--data_path data/example_data.h5ad \
--config_path results/optuna_studies/mlp_optimization/trial_42/parameters.json \
--save_dir results/final_model/ \
--resume false
Resume Training
# Resume training from existing checkpoint with optimal parameters
python -m scxpand.main train \
--model_type mlp \
--data_path data/example_data.h5ad \
--config_path results/optuna_studies/mlp_optimization/trial_42/parameters.json \
--save_dir results/final_model/ \
--resume true
Inference with Trained Model#
Use the already trained model from optimization for predictions:
# Use the trained model directly for inference on new data
python -m scxpand.main inference \
--model_path results/optuna_studies/mlp_optimization/trial_42 \
--data_path new_data.h5ad \
--save_path predictions/
Custom Configuration#
Custom Parameter Overrides#
Override specific parameters while optimizing others:
# Fix batch size while optimizing other parameters
python -m scxpand.main optimize \
--model_type mlp \
--data_path data.h5ad \
--n_trials 100 \
--batch_size 4096 \
--use_log_transform true
You can override any parameter that appears in the model’s parameter class.
Configuration Files#
Use JSON configuration files for complex parameter sets:
{
"use_log_transform": true,
"use_zscore_norm": true,
"n_epochs": 50,
"early_stopping_patience": 10,
"learning_rate": 1e-4,
"target_sum": 10000
}
# Use configuration file
python -m scxpand.main optimize \
--model_type autoencoder \
--data_path data.h5ad \
--config_path config/autoencoder_config.json \
--n_trials 200
Study Resumption#
scXpand provides resume functionality controlled by the --resume flag.
Resuming a Study (`–resume True`, Default):
By default, scXpand will automatically detect and continue a study if it finds an existing one with the same name.
# This will resume the study "existing_mlp_study" if it exists,
# or create it if it doesn't.
python -m scxpand.main optimize \
--model_type mlp \
--data_path data/example_data.h5ad \
--study_name "existing_mlp_study" \
--n_trials 50 # Run 50 additional trials
Starting a Fresh Study (`–resume False`):
To ensure you don’t accidentally overwrite results, you must explicitly set --resume False to start a new study if one with the same name already exists. If existing trials are found, the program will stop and provide instructions.
# This will fail if "mlp_fresh_study" already has trials.
python -m scxpand.main optimize \
--model_type mlp \
--data_path data/example_data.h5ad \
--study_name "mlp_fresh_study" \
--resume False \
--n_trials 100
Optimization System Architecture#
Study Management#
Each optimization run creates a persistent study stored in SQLite:
results/optuna_studies/
├── mlp_optimization/
│ ├── optuna.db # Trial database
│ ├── info.json # Study summary
│ ├── trials.log # Detailed trial logs
│ └── trial_0/, trial_1/... # Individual trial results
└── autoencoder_study/
└── ...
- Study Components:
Database: Persistent storage of all trials and results
Metadata: Study configuration and best trial information
Trial Artifacts: Complete model outputs for each trial
Logs: Detailed execution logs for debugging
Optimization Algorithm#
scXpand uses Optuna’s TPE (Tree-structured Parzen Estimator) sampler with automated pruning:
from scxpand.hyperopt.hyperopt_optimizer import HyperparameterOptimizer
# Create optimizer with custom configuration
optimizer = HyperparameterOptimizer(
model_type="autoencoder",
data_path="data.h5ad",
study_name="custom_ae_opt",
score_metric="harmonic_avg/AUROC", # Optimization target
seed_base=42, # Reproducibility
num_workers=4, # Parallel trials
resume=True, # Resume existing study
fail_fast=False # Continue on errors
)
API Reference#
Hyperparameter Optimization Functions#
Single Model Optimization#
- scxpand.main.optimize(model_type, data_path='data/example_data.h5ad', n_trials=100, study_name=None, storage_path='results/optuna_studies', score_metric='harmonic_avg/AUROC', resume=True, seed_base=42, num_workers=4, config_path=None, fail_fast=False, **kwargs)#
Run hyperparameter optimization for a specified model type.
- Parameters:
model_type (
ModelType|str) – Type of model to optimize (autoencoder, mlp, lightgbm, logistic, svm).data_path (
str(default:'data/example_data.h5ad')) – Path to the input data file (h5ad format).n_trials (
int(default:100)) – Number of optimization trials to run.study_name (
str|None(default:None)) – Name of the optimization study (defaults to model_type).storage_path (
str(default:'results/optuna_studies')) – Directory to store optimization results.score_metric (
str(default:'harmonic_avg/AUROC')) – Metric to optimize (e.g., “harmonic_avg/AUROC”, “AUROC”, “AUPRC”).resume (
bool(default:True)) – Whether to resume from existing study (False = start fresh).seed_base (
int(default:42)) – Base seed for reproducibility across trials.num_workers (
int(default:4)) – Number of workers for parallel processing.config_path (
str|None(default:None)) – Path to configuration file for base parameters.fail_fast (
bool(default:False)) – Whether to fail immediately on any exception (for testing).**kwargs (
Any) – Additional parameters to override config.
- Raises:
ValueError – If model_type is not supported for optimization.
FileNotFoundError – If data_path does not exist.
ValueError – If study already exists and resume=False (with instructions to delete manually).
- Return type:
- Returns:
None.
Examples
>>> # Single model optimization >>> python -m scxpand.main optimize --model_type autoencoder --n_trials 100 --data_path data/example_data.h5ad >>> python -m scxpand.main optimize --model_type mlp --n_trials 100 --data_path data/example_data.h5ad --n_epochs 10
Multi-Model Optimization#
- scxpand.main.optimize_all(data_path='data/example_data.h5ad', n_trials=100, storage_path='results/optuna_studies', score_metric='harmonic_avg/AUROC', resume=True, num_workers=4, model_types=None, **kwargs)#
Run hyperparameter optimization for all supported model types or a specified subset.
- Parameters:
data_path (
str(default:'data/example_data.h5ad')) – Path to the input data file (h5ad format).n_trials (
int(default:100)) – Number of optimization trials per model type.storage_path (
str(default:'results/optuna_studies')) – Directory to store optimization results.score_metric (
str(default:'harmonic_avg/AUROC')) – Metric to optimize (e.g., “harmonic_avg/AUROC”, “AUROC”, “AUPRC”).resume (
bool(default:True)) – Whether to resume existing studies (False = start fresh for all models).num_workers (
int(default:4)) – Number of workers for parallel processing.model_types (
list[ModelType] |None(default:None)) – List of model types to optimize in order. If None, optimizes all supported models. Supported types: [“autoencoder”, “mlp”, “lightgbm”, “logistic”, “svm”].**kwargs (
Any) – Additional parameters to override config for all models.
- Return type:
- Returns:
None.
Examples
>>> # Optimize all models (parallel processing) >>> python -m scxpand.main optimize-all --n_trials 10 --data_path data/example_data.h5ad --num_workers 6 >>> >>> # Optimize specific model types only >>> python -m scxpand.main optimize-all --n_trials 100 --data_path data/example_data.h5ad --model_types mlp,autoencoder
Optimization Framework#
Hyperparameter optimization using Optuna with robust trial management.
- class scxpand.hyperopt.hyperopt_optimizer.HyperparameterOptimizer(model_type, data_path, study_name=None, storage_path='results/optuna_studies', score_metric='harmonic_avg/AUROC', seed_base=42, num_workers=0, config_path=None, resume=True, fail_fast=False, **param_overrides)#
Bases:
objectRobust hyperparameter optimizer using Optuna with enhanced trial management.
Features: - Automatic cleanup of incomplete trials - Proper exception handling and categorization - Resume capability with duplicate prevention - Comprehensive logging and monitoring
- __init__(model_type, data_path, study_name=None, storage_path='results/optuna_studies', score_metric='harmonic_avg/AUROC', seed_base=42, num_workers=0, config_path=None, resume=True, fail_fast=False, **param_overrides)#
Initialize the hyperparameter optimizer.
- Parameters:
model_type (
ModelType|str) – Type of model to optimize (MLP, SVM, etc.).study_name (
str|None(default:None)) – Name for the Optuna study (auto-generated if None).storage_path (
str|Path(default:'results/optuna_studies')) – Directory to store study results.score_metric (
str(default:'harmonic_avg/AUROC')) – Metric to optimize (e.g., “harmonic_avg/AUROC”).seed_base (
int(default:42)) – Base seed for reproducibility.num_workers (
int(default:0)) – Number of parallel workers (0 for single-threaded).config_path (
str|None(default:None)) – Path to configuration file for parameter overrides.resume (
bool(default:True)) – Whether to resume existing study (False = start fresh).fail_fast (
bool(default:False)) – Whether to fail immediately on any exception (for testing).**param_overrides – Additional parameter overrides.
- create_study()#
Create or load an Optuna study based on the resume setting.
- Return type:
- Returns:
The Optuna study object.
- Raises:
ValueError – If study exists but resume=False.
- objective(trial)#
Objective function for Optuna trials.
- print_results(study=None)#
Print optimization results.
Hyperparameter Search Ranges#
All hyperparameter definitions and ranges are maintained in scxpand/hyperopt/param_grids.py.