scxpand.data_util.data_splitter#
Functions
Calculate and log cancer type distribution across train/validation splits. |
|
Calculate and log category distributions (imputed_labels, tissue_type) across train/validation splits. |
|
|
Generate unique patient identifiers by combining study and patient columns. |
|
Log overview statistics of the dataset. |
|
Log the results of the data split. |
|
Save patient IDs to CSV files. |
|
Split the data into training and validation sets using patient-level stratification. |
Validate that each patient has exactly one cancer type and return cancer types per patient. |
- scxpand.data_util.data_splitter.calculate_and_log_cancer_distribution(train_patient_ids, dev_patient_ids, uniq_patient_ids, cancer_types_per_patient)#
Calculate and log cancer type distribution across train/validation splits.
- scxpand.data_util.data_splitter.calculate_and_log_category_distributions(train_obs_df, dev_obs_df)#
Calculate and log category distributions (imputed_labels, tissue_type) across train/validation splits.
- scxpand.data_util.data_splitter.get_patient_identifiers(obs_df)#
Generate unique patient identifiers by combining study and patient columns.
Creates composite identifiers in the format ‘study:patient’ to uniquely identify patients across different studies.
- Parameters:
obs_df (
DataFrame) – DataFrame containing ‘study’ and ‘patient’ columns.- Return type:
- Returns:
Series of unique patient identifiers.
- Raises:
ValueError – If study or patient identifiers contain the separator character.
Example
>>> identifiers = get_patient_identifiers(adata.obs) >>> print(identifiers.head()) # ['study1:patient1', 'study1:patient2', ...]
- scxpand.data_util.data_splitter.log_dataset_overview(obs_df, patient_identifiers)#
Log overview statistics of the dataset.
- scxpand.data_util.data_splitter.log_split_results(n_train_patients, n_dev_patients, n_train_cells, n_dev_cells, n_total_cells)#
Log the results of the data split.
- Parameters:
- Return type:
- scxpand.data_util.data_splitter.save_patient_ids(save_path, train_patient_ids, dev_patient_ids)#
Save patient IDs to CSV files.
- scxpand.data_util.data_splitter.split_data(adata, dev_ratio, random_seed=None, save_path=None)#
Split the data into training and validation sets using patient-level stratification.
This function performs stratified splitting at the patient level to ensure that: 1. No patient appears in both training and validation sets 2. Cancer type distributions are preserved across splits 3. Detailed logging provides transparency into the split process
- Parameters:
- Returns:
(row_inds_train, row_inds_dev) - sorted indices for training and validation sets
- Return type:
- scxpand.data_util.data_splitter.validate_patient_cancer_types(uniq_patient_ids, patient_identifiers, obs_df)#
Validate that each patient has exactly one cancer type and return cancer types per patient.
- Parameters:
- Return type:
- Returns:
List of cancer types per patient
- Raises:
ValueError – If any patient has multiple cancer types