tidymut.cleaners.protein_gym_cleaner module
- class tidymut.cleaners.protein_gym_cleaner.ProteinGymCleanerConfig(pipeline_name: str = 'protein_gym_cleaner', num_workers: int = 16, validate_config: bool = True, column_mapping: Dict[str, str] = <factory>, filters: Dict[str, Any] = <factory>, type_conversions: Dict[str, str] = <factory>, validation_workers: int = 16, infer_wt_workers: int = 16, handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error', label_columns: List[str] = <factory>, primary_label_column: str = 'DMS_score')[source]
Bases:
BaseCleanerConfig
Configuration class for ProteinGym dataset cleaner
Inherits from BaseCleanerConfig and adds ProteinGym-specific configuration options.
- column_mapping
Mapping from source to target column names
- Type:
Dict[str, str]
- filters
Filter conditions for data cleaning
- Type:
Dict[str, Any]
- type_conversions
Data type conversion specifications
- Type:
Dict[str, str]
- is_zero_based
Whether mutation positions are zero-based
- Type:
bool
- validation_workers
Number of workers for mutation validation
- Type:
int
- infer_wt_workers
Number of workers for wildtype sequence inference
- Type:
int
- handle_multiple_wt
Strategy for handling multiple wildtype sequences
- Type:
Literal[“error”, “first”, “separate”]
- label_columns
List of score columns to process
- Type:
List[str]
- primary_label_column
Primary score column for the dataset
- Type:
str
- column_mapping: Dict[str, str]
- filters: Dict[str, Any]
- handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error'
- infer_wt_workers: int = 16
- label_columns: List[str]
- pipeline_name: str = 'protein_gym_cleaner'
- primary_label_column: str = 'DMS_score'
- type_conversions: Dict[str, str]
- validate() None [source]
Validate ProteinGym-specific configuration parameters
- Raises:
ValueError – If configuration is invalid
- validation_workers: int = 16
- tidymut.cleaners.protein_gym_cleaner.clean_protein_gym_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset] [source]
Clean ProteinGym dataset using configurable pipeline
- Parameters:
pipeline (Pipeline) – ProteinGym dataset cleaning pipeline
- Returns:
Pipeline: The cleaned pipeline
MutationDataset: The cleaned ProteinGym dataset
- Return type:
Tuple[Pipeline, MutationDataset]
- tidymut.cleaners.protein_gym_cleaner.create_protein_gym_cleaner(data_path: str | Path, config: ProteinGymCleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline [source]
Create ProteinGym dataset cleaning pipeline
- Parameters:
data_path (Union[str, Path]) – Path to directory containing ProteinGym CSV files or path to zip file - Download from: https://proteingym.org/download - File: DMS_ProteinGym_substitutions.zip
config (Optional[Union[ProteinGymCleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - ProteinGymCleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)
- Returns:
The cleaning pipeline
- Return type:
- Raises:
TypeError – If config has invalid type
ValueError – If configuration validation fails
Examples
Process directory of ProteinGym CSV files: >>> pipeline = create_protein_gym_cleaner(“DMS_ProteinGym_substitutions/”) >>> pipeline, dataset = clean_protein_gym_dataset(pipeline)
Process zip file: >>> pipeline = create_protein_gym_cleaner(“DMS_ProteinGym_substitutions.zip”) >>> pipeline, dataset = clean_protein_gym_dataset(pipeline)
Custom configuration: >>> config = { … “validation_workers”: 8, … “handle_multiple_wt”: “first” … } >>> pipeline = create_protein_gym_cleaner(“data/”, config=config)
Load configuration from file: >>> pipeline = create_protein_gym_cleaner(“data/”, config=”config.json”)