Output Format Specification#

This document describes the output files generated by scXpand during training, optimization, and inference.

Overview#

After training, various files are generated in the results directory depending on the model type and configuration. All models produce a core set of common files, with additional model-specific files based on the architecture and training approach.

Common Files (All Models)#

These files are generated for every model type:

File

Description

parameters.json

Model configuration and hyperparameters

model_type.txt

Model type identifier (e.g., “mlp”, “autoencoder”)

data_format.json

Input data preprocessing configuration

results_dev.txt

Evaluation metrics in text format

results_table_dev.csv

Detailed per-cell predictions and metadata

train_patient_ids.csv

Patient IDs used for training

dev_patient_ids.csv

Patient IDs used for validation

Model Files (Model-Specific)#

Different model types save their trained models in different formats:

Model Type

File

Description

PyTorch Models (autoencoder, mlp)

model.pt

Trained model state dictionary

Scikit-learn Models (logistic, svm, lightgbm)

model.joblib

Serialized model using joblib

Visualization Files#

File

Location

Description

roc_curve_dev.png

plots/ subdirectory

ROC curve visualization for validation set

PyTorch Model Files (autoencoder, mlp)#

PyTorch models (autoencoder and mlp) use the TrainLogger and create additional training monitoring files:

File

Description

run_info.json

Training run metadata and configuration

summary_info.json

Training summary including final metrics

best_model_info.json

Configuration and metrics of best performing model

Autoencoder and MLP-Specific Files#

Only the autoencoder and the MLP models create checkpoint files for training resumption:

File

Description

best_ckpt.pt

Best model checkpoint during training

last_ckpt.pt

Final model checkpoint

best_model_dev_set_metrics.json

Detailed metrics for the best model

Linear Model Files (logistic, svm)#

Logistic regression and SVM models use a simplified logger and create:

File

Description

best_model_info.json

Best hyperparameters and performance metrics

summary_info.json

Training summary including final metrics

Hyperparameter Optimization Files#

When using optimize or optimize-all commands, additional files are created in the study directories:

File

Location

Description

optuna.db

results/optuna_studies/{study_name}/

SQLite database storing all trial data

info.json

results/optuna_studies/{study_name}/

Study summary with best trial results

trials.log

results/optuna_studies/{study_name}/

Detailed log of all trial executions

trial_{N}/

results/optuna_studies/{study_name}/

Individual trial directories with full model outputs

Each trial_{N}/ directory contains the complete output files for that trial (model files, metrics, plots, etc.) as described in the sections above.

Model Type File#

The model_type.txt file is automatically generated during training and contains a simple text identifier for the model type. This enables automatic model type detection during inference.

File Format:

The file contains just the model type as plain text:

mlp

Supported Values:

  • autoencoder - Autoencoder models

  • mlp - Multi-layer perceptron models

  • lightgbm - LightGBM gradient boosting models

  • logistic - Logistic regression models

  • svm - Support vector machine models

Usage:

This file is automatically used by the inference pipeline to detect the model type without manual specification. If the file is missing, users can create it manually or specify the model type explicitly in their inference code.

Directory Structure Example#

results/
├── autoencoder_v1/                    # Main model results
│   ├── model.pt                       # Trained model
│   ├── parameters.json                # Model configuration
│   ├── model_type.txt                 # Model type identifier
│   ├── data_format.json               # Data preprocessing config
│   ├── results_dev.txt                # Evaluation metrics
│   ├── plots/
│   │   └── roc_curve_dev.png          # ROC curve
│   └── ...
└── optuna_studies/                    # Hyperparameter optimization
    └── autoencoder_study/
        ├── optuna.db                  # Trial database
        ├── info.json                  # Study summary
        ├── trials.log                 # Trial logs
        └── trial_0/, trial_1/, ...    # Individual trials