Output Format Specification#
This document describes the output files generated by scXpand during training, optimization, and inference.
Overview#
After training, various files are generated in the results directory depending on the model type and configuration. All models produce a core set of common files, with additional model-specific files based on the architecture and training approach.
Common Files (All Models)#
These files are generated for every model type:
File |
Description |
|---|---|
|
Model configuration and hyperparameters |
|
Model type identifier (e.g., “mlp”, “autoencoder”) |
|
Input data preprocessing configuration |
|
Evaluation metrics in text format |
|
Detailed per-cell predictions and metadata |
|
Patient IDs used for training |
|
Patient IDs used for validation |
Model Files (Model-Specific)#
Different model types save their trained models in different formats:
Model Type |
File |
Description |
|---|---|---|
PyTorch Models (autoencoder, mlp) |
|
Trained model state dictionary |
Scikit-learn Models (logistic, svm, lightgbm) |
|
Serialized model using joblib |
Visualization Files#
File |
Location |
Description |
|---|---|---|
|
|
ROC curve visualization for validation set |
PyTorch Model Files (autoencoder, mlp)#
PyTorch models (autoencoder and mlp) use the TrainLogger and create additional training monitoring files:
File |
Description |
|---|---|
|
Training run metadata and configuration |
|
Training summary including final metrics |
|
Configuration and metrics of best performing model |
Autoencoder and MLP-Specific Files#
Only the autoencoder and the MLP models create checkpoint files for training resumption:
File |
Description |
|---|---|
|
Best model checkpoint during training |
|
Final model checkpoint |
|
Detailed metrics for the best model |
Linear Model Files (logistic, svm)#
Logistic regression and SVM models use a simplified logger and create:
File |
Description |
|---|---|
|
Best hyperparameters and performance metrics |
|
Training summary including final metrics |
Hyperparameter Optimization Files#
When using optimize or optimize-all commands, additional files are created in the study directories:
File |
Location |
Description |
|---|---|---|
|
|
SQLite database storing all trial data |
|
|
Study summary with best trial results |
|
|
Detailed log of all trial executions |
|
|
Individual trial directories with full model outputs |
Each trial_{N}/ directory contains the complete output files for that trial (model files, metrics, plots, etc.) as described in the sections above.
Model Type File#
The model_type.txt file is automatically generated during training and contains a simple text identifier for the model type. This enables automatic model type detection during inference.
File Format:
The file contains just the model type as plain text:
mlp
Supported Values:
autoencoder- Autoencoder modelsmlp- Multi-layer perceptron modelslightgbm- LightGBM gradient boosting modelslogistic- Logistic regression modelssvm- Support vector machine models
Usage:
This file is automatically used by the inference pipeline to detect the model type without manual specification. If the file is missing, users can create it manually or specify the model type explicitly in their inference code.
Directory Structure Example#
results/
├── autoencoder_v1/ # Main model results
│ ├── model.pt # Trained model
│ ├── parameters.json # Model configuration
│ ├── model_type.txt # Model type identifier
│ ├── data_format.json # Data preprocessing config
│ ├── results_dev.txt # Evaluation metrics
│ ├── plots/
│ │ └── roc_curve_dev.png # ROC curve
│ └── ...
└── optuna_studies/ # Hyperparameter optimization
└── autoencoder_study/
├── optuna.db # Trial database
├── info.json # Study summary
├── trials.log # Trial logs
└── trial_0/, trial_1/, ... # Individual trials