tidymut.cleaners.protein_gym_custom_cleaners module

tidymut.cleaners.protein_gym_custom_cleaners.read_protein_gym_data(data_path: str | Path) Tuple[pd.DataFrame, pd.DataFrame][source]

Read and combine multiple ProteinGym datasets from a directory or zip file.

ProteinGym datasets are stored as individual CSV files, one per protein. This function combines them into a single DataFrame for unified processing. Each file contains columns: mutant, mutated_sequence, DMS_score, and various prediction methods.

Parameters:

data_path (Union[str, Path]) – Path to directory containing ProteinGym CSV files or path to zip file

Returns:

(success_dataframe, failed_dataframe) - successfully processed data and failed file info

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Raises:
  • FileNotFoundError – If data_path does not exist

  • ValueError – If no CSV files found or required columns missing

Examples

Process directory of ProteinGym CSV files: >>> success_df, failed_df = read_proteingym_batch_datasets(“DMS_ProteinGym_substitutions/”)

Process zip file: >>> success_df, failed_df = read_proteingym_batch_datasets(“DMS_ProteinGym_substitutions.zip”)