Output Format Specification
============================

This document describes the output files generated by scXpand during training, optimization, and inference.

Overview
--------

After training, various files are generated in the results directory depending on the model type and configuration. All models produce a core set of common files, with additional model-specific files based on the architecture and training approach.

Common Files (All Models)
--------------------------

These files are generated for every model type:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - File
     - Description
   * - ``parameters.json``
     - Model configuration and hyperparameters
   * - ``model_type.txt``
     - Model type identifier (e.g., "mlp", "autoencoder")
   * - ``data_format.json``
     - Input data preprocessing configuration
   * - ``results_dev.txt``
     - Evaluation metrics in text format
   * - ``results_table_dev.csv``
     - Detailed per-cell predictions and metadata
   * - ``train_patient_ids.csv``
     - Patient IDs used for training
   * - ``dev_patient_ids.csv``
     - Patient IDs used for validation

Model Files (Model-Specific)
-----------------------------

Different model types save their trained models in different formats:

.. list-table::
   :header-rows: 1
   :widths: 40 20 40

   * - Model Type
     - File
     - Description
   * - PyTorch Models (autoencoder, mlp)
     - ``model.pt``
     - Trained model state dictionary
   * - Scikit-learn Models (logistic, svm, lightgbm)
     - ``model.joblib``
     - Serialized model using joblib

Visualization Files
-------------------

.. list-table::
   :header-rows: 1
   :widths: 30 20 50

   * - File
     - Location
     - Description
   * - ``roc_curve_dev.png``
     - ``plots/`` subdirectory
     - ROC curve visualization for validation set

PyTorch Model Files (autoencoder, mlp)
---------------------------------------

PyTorch models (autoencoder and mlp) use the TrainLogger and create additional training monitoring files:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - File
     - Description
   * - ``run_info.json``
     - Training run metadata and configuration
   * - ``summary_info.json``
     - Training summary including final metrics
   * - ``best_model_info.json``
     - Configuration and metrics of best performing model

Autoencoder and MLP-Specific Files
----------------------------------

Only the autoencoder and the MLP models create checkpoint files for training resumption:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - File
     - Description
   * - ``best_ckpt.pt``
     - Best model checkpoint during training
   * - ``last_ckpt.pt``
     - Final model checkpoint
   * - ``best_model_dev_set_metrics.json``
     - Detailed metrics for the best model

Linear Model Files (logistic, svm)
-----------------------------------

Logistic regression and SVM models use a simplified logger and create:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - File
     - Description
   * - ``best_model_info.json``
     - Best hyperparameters and performance metrics
   * - ``summary_info.json``
     - Training summary including final metrics

Hyperparameter Optimization Files
----------------------------------

When using ``optimize`` or ``optimize-all`` commands, additional files are created in the study directories:

.. list-table::
   :header-rows: 1
   :widths: 25 35 40

   * - File
     - Location
     - Description
   * - ``optuna.db``
     - ``results/optuna_studies/{study_name}/``
     - SQLite database storing all trial data
   * - ``info.json``
     - ``results/optuna_studies/{study_name}/``
     - Study summary with best trial results
   * - ``trials.log``
     - ``results/optuna_studies/{study_name}/``
     - Detailed log of all trial executions
   * - ``trial_{N}/``
     - ``results/optuna_studies/{study_name}/``
     - Individual trial directories with full model outputs

Each ``trial_{N}/`` directory contains the complete output files for that trial (model files, metrics, plots, etc.) as described in the sections above.

Model Type File
---------------

The ``model_type.txt`` file is automatically generated during training and contains a simple text identifier for the model type. This enables automatic model type detection during inference.

**File Format:**

The file contains just the model type as plain text:

.. code-block:: text

   mlp

**Supported Values:**

- ``autoencoder`` - Autoencoder models
- ``mlp`` - Multi-layer perceptron models
- ``lightgbm`` - LightGBM gradient boosting models
- ``logistic`` - Logistic regression models
- ``svm`` - Support vector machine models

**Usage:**

This file is automatically used by the inference pipeline to detect the model type without manual specification. If the file is missing, users can create it manually or specify the model type explicitly in their inference code.

Directory Structure Example
----------------------------

.. code-block:: text

   results/
   ├── autoencoder_v1/                    # Main model results
   │   ├── model.pt                       # Trained model
   │   ├── parameters.json                # Model configuration
   │   ├── model_type.txt                 # Model type identifier
   │   ├── data_format.json               # Data preprocessing config
   │   ├── results_dev.txt                # Evaluation metrics
   │   ├── plots/
   │   │   └── roc_curve_dev.png          # ROC curve
   │   └── ...
   └── optuna_studies/                    # Hyperparameter optimization
       └── autoencoder_study/
           ├── optuna.db                  # Trial database
           ├── info.json                  # Study summary
           ├── trials.log                 # Trial logs
           └── trial_0/, trial_1/, ...    # Individual trials
