tidymut.cleaners.protein_gym_custom_cleaners module
- tidymut.cleaners.protein_gym_custom_cleaners.read_protein_gym_data(data_path: str | Path) Tuple[pd.DataFrame, pd.DataFrame] [source]
Read and combine multiple ProteinGym datasets from a directory or zip file.
ProteinGym datasets are stored as individual CSV files, one per protein. This function combines them into a single DataFrame for unified processing. Each file contains columns: mutant, mutated_sequence, DMS_score, and various prediction methods.
- Parameters:
data_path (Union[str, Path]) – Path to directory containing ProteinGym CSV files or path to zip file
- Returns:
(success_dataframe, failed_dataframe) - successfully processed data and failed file info
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- Raises:
FileNotFoundError – If data_path does not exist
ValueError – If no CSV files found or required columns missing
Examples
Process directory of ProteinGym CSV files: >>> success_df, failed_df = read_proteingym_batch_datasets(“DMS_ProteinGym_substitutions/”)
Process zip file: >>> success_df, failed_df = read_proteingym_batch_datasets(“DMS_ProteinGym_substitutions.zip”)