scxpand.data_util.data_splitter#

Functions

calculate_and_log_cancer_distribution(...)

Calculate and log cancer type distribution across train/validation splits.

calculate_and_log_category_distributions(...)

Calculate and log category distributions (imputed_labels, tissue_type) across train/validation splits.

get_patient_identifiers(obs_df)

Generate unique patient identifiers by combining study and patient columns.

log_dataset_overview(obs_df, patient_identifiers)

Log overview statistics of the dataset.

log_split_results(n_train_patients, ...)

Log the results of the data split.

save_patient_ids(save_path, ...)

Save patient IDs to CSV files.

split_data(adata, dev_ratio[, random_seed, ...])

Split the data into training and validation sets using patient-level stratification.

validate_patient_cancer_types(...)

Validate that each patient has exactly one cancer type and return cancer types per patient.

scxpand.data_util.data_splitter.calculate_and_log_cancer_distribution(train_patient_ids, dev_patient_ids, uniq_patient_ids, cancer_types_per_patient)#

Calculate and log cancer type distribution across train/validation splits.

Parameters:
  • train_patient_ids (list[str]) – List of patient IDs in training set

  • dev_patient_ids (list[str]) – List of patient IDs in validation set

  • uniq_patient_ids (list[str]) – List of all unique patient IDs

  • cancer_types_per_patient (list[str]) – List of cancer types per patient

Return type:

None

scxpand.data_util.data_splitter.calculate_and_log_category_distributions(train_obs_df, dev_obs_df)#

Calculate and log category distributions (imputed_labels, tissue_type) across train/validation splits.

Parameters:
  • train_obs_df (DataFrame) – Training set observation DataFrame

  • dev_obs_df (DataFrame) – Validation set observation DataFrame

Return type:

None

scxpand.data_util.data_splitter.get_patient_identifiers(obs_df)#

Generate unique patient identifiers by combining study and patient columns.

Creates composite identifiers in the format ‘study:patient’ to uniquely identify patients across different studies.

Parameters:

obs_df (DataFrame) – DataFrame containing ‘study’ and ‘patient’ columns.

Return type:

Series

Returns:

Series of unique patient identifiers.

Raises:

ValueError – If study or patient identifiers contain the separator character.

Example

>>> identifiers = get_patient_identifiers(adata.obs)
>>> print(identifiers.head())  # ['study1:patient1', 'study1:patient2', ...]
scxpand.data_util.data_splitter.log_dataset_overview(obs_df, patient_identifiers)#

Log overview statistics of the dataset.

Parameters:
  • obs_df (DataFrame) – DataFrame containing observation data

  • patient_identifiers (Series) – Series of unique patient identifiers

Return type:

None

scxpand.data_util.data_splitter.log_split_results(n_train_patients, n_dev_patients, n_train_cells, n_dev_cells, n_total_cells)#

Log the results of the data split.

Parameters:
  • n_train_patients (int) – Number of patients in training set

  • n_dev_patients (int) – Number of patients in validation set

  • n_train_cells (int) – Number of cells in training set

  • n_dev_cells (int) – Number of cells in validation set

  • n_total_cells (int) – Total number of cells

Return type:

None

scxpand.data_util.data_splitter.save_patient_ids(save_path, train_patient_ids, dev_patient_ids)#

Save patient IDs to CSV files.

Parameters:
  • save_path (Path) – Directory path to save files

  • train_patient_ids (list[str]) – List of training patient IDs

  • dev_patient_ids (list[str]) – List of validation patient IDs

Return type:

None

scxpand.data_util.data_splitter.split_data(adata, dev_ratio, random_seed=None, save_path=None)#

Split the data into training and validation sets using patient-level stratification.

This function performs stratified splitting at the patient level to ensure that: 1. No patient appears in both training and validation sets 2. Cancer type distributions are preserved across splits 3. Detailed logging provides transparency into the split process

Parameters:
  • adata (AnnData) – AnnData object with the data

  • dev_ratio (float) – The ratio of the validation set (0.0 to 1.0)

  • random_seed (int | None (default: None)) – Random seed for reproducible splits

  • save_path (Path | None (default: None)) – Optional path to save the split patient IDs as CSV files

Returns:

(row_inds_train, row_inds_dev) - sorted indices for training and validation sets

Return type:

tuple

scxpand.data_util.data_splitter.validate_patient_cancer_types(uniq_patient_ids, patient_identifiers, obs_df)#

Validate that each patient has exactly one cancer type and return cancer types per patient.

Parameters:
  • uniq_patient_ids (list[str]) – List of unique patient IDs

  • patient_identifiers (Series) – Series of patient identifiers

  • obs_df (DataFrame) – DataFrame containing observation data

Return type:

list[str]

Returns:

List of cancer types per patient

Raises:

ValueError – If any patient has multiple cancer types