tidymut.cleaners package
Submodules
- tidymut.cleaners.base_config module
BaseCleanerConfig
BaseCleanerConfig.pipeline_name
BaseCleanerConfig.strict_mode
BaseCleanerConfig.num_workers
BaseCleanerConfig.validate_config
BaseCleanerConfig.from_dict()
BaseCleanerConfig.from_json()
BaseCleanerConfig.get_summary()
BaseCleanerConfig.merge()
BaseCleanerConfig.num_workers
BaseCleanerConfig.pipeline_name
BaseCleanerConfig.strict_mode
BaseCleanerConfig.to_dict()
BaseCleanerConfig.to_json()
BaseCleanerConfig.validate()
BaseCleanerConfig.validate_config
- tidymut.cleaners.basic_cleaners module
- tidymut.cleaners.human_domainome_cleaner module
HumanDomainomeCleanerConfig
HumanDomainomeCleanerConfig.sequence_dict_path
HumanDomainomeCleanerConfig.header_parser
HumanDomainomeCleanerConfig.column_mapping
HumanDomainomeCleanerConfig.type_conversions
HumanDomainomeCleanerConfig.drop_na_columns
HumanDomainomeCleanerConfig.is_zero_based
HumanDomainomeCleanerConfig.process_workers
HumanDomainomeCleanerConfig.label_columns
HumanDomainomeCleanerConfig.primary_label_column
HumanDomainomeCleanerConfig.column_mapping
HumanDomainomeCleanerConfig.drop_na_columns
HumanDomainomeCleanerConfig.header_parser()
HumanDomainomeCleanerConfig.is_zero_based
HumanDomainomeCleanerConfig.label_columns
HumanDomainomeCleanerConfig.pipeline_name
HumanDomainomeCleanerConfig.primary_label_column
HumanDomainomeCleanerConfig.process_workers
HumanDomainomeCleanerConfig.sequence_dict_path
HumanDomainomeCleanerConfig.type_conversions
HumanDomainomeCleanerConfig.validate()
clean_human_domainome_dataset()
create_human_domainome_cleaner()
- tidymut.cleaners.human_domainome_custom_cleaners module
- tidymut.cleaners.k50_cleaner module
K50CleanerConfig
K50CleanerConfig.column_mapping
K50CleanerConfig.filters
K50CleanerConfig.type_conversions
K50CleanerConfig.validation_workers
K50CleanerConfig.infer_wt_workers
K50CleanerConfig.handle_multiple_wt
K50CleanerConfig.label_columns
K50CleanerConfig.primary_label_column
K50CleanerConfig.column_mapping
K50CleanerConfig.filters
K50CleanerConfig.handle_multiple_wt
K50CleanerConfig.infer_wt_workers
K50CleanerConfig.label_columns
K50CleanerConfig.pipeline_name
K50CleanerConfig.primary_label_column
K50CleanerConfig.type_conversions
K50CleanerConfig.validate()
K50CleanerConfig.validation_workers
clean_k50_dataset()
create_k50_cleaner()
- tidymut.cleaners.protein_gym_cleaner module
ProteinGymCleanerConfig
ProteinGymCleanerConfig.column_mapping
ProteinGymCleanerConfig.filters
ProteinGymCleanerConfig.type_conversions
ProteinGymCleanerConfig.is_zero_based
ProteinGymCleanerConfig.validation_workers
ProteinGymCleanerConfig.infer_wt_workers
ProteinGymCleanerConfig.handle_multiple_wt
ProteinGymCleanerConfig.label_columns
ProteinGymCleanerConfig.primary_label_column
ProteinGymCleanerConfig.column_mapping
ProteinGymCleanerConfig.filters
ProteinGymCleanerConfig.handle_multiple_wt
ProteinGymCleanerConfig.infer_wt_workers
ProteinGymCleanerConfig.label_columns
ProteinGymCleanerConfig.pipeline_name
ProteinGymCleanerConfig.primary_label_column
ProteinGymCleanerConfig.type_conversions
ProteinGymCleanerConfig.validate()
ProteinGymCleanerConfig.validation_workers
clean_protein_gym_dataset()
create_protein_gym_cleaner()
- tidymut.cleaners.protein_gym_custom_cleaners module
Module contents
- class tidymut.cleaners.HumanDomainomeCleanerConfig(num_workers: int = 16, validate_config: bool = True, *, pipeline_name: str = 'human_domainome_cleaner', sequence_dict_path: Union[str, Path], header_parser: Callable[[str], Tuple[str, Dict[str, str]]] = <function parse_uniprot_header>, column_mapping: Dict[str, str] = <factory>, type_conversions: Dict[str, str] = <factory>, drop_na_columns: List = <factory>, is_zero_based: bool = False, process_workers: int = 16, label_columns: List[str] = <factory>, primary_label_column: str = 'label_humanDomainome')[source]
Bases:
BaseCleanerConfig
Configuration class for HumanDomainome dataset cleaner
Inherits from BaseCleanerConfig and adds HumanDomainome-specific configuration options.
- sequence_dict_path
Path to the file containing UniProt ID to sequence mapping
- Type:
Union[str, Path]
- header_parser
Parse Header in fasta files and extract relevant information
- Type:
Callable[[str], Tuple[str, Dict[str, str]]]
- column_mapping
Mapping from source to target column names
- Type:
Dict[str, str]
- type_conversions
Data type conversion specifications
- Type:
Dict[str, str]
- drop_na_columns
List of column names where null values should be dropped
- Type:
List[str]
- is_zero_based
Whether mutation positions are zero-based
- Type:
bool
- process_workers
Number of workers for parallel processing
- Type:
int
- label_columns
List of score columns to process
- Type:
List[str]
- primary_label_column
Primary score column for the dataset
- Type:
str
- column_mapping: Dict[str, str]
- drop_na_columns: List
- header_parser() Tuple[str, Dict[str, str]]
Parse UniProt FASTA header to extract ID and metadata
- Parameters:
header (str) – FASTA header line (without ‘>’)
- Returns:
(sequence_id, metadata_dict)
- Return type:
Tuple[str, Dict[str, str]]
Examples
>>> parse_uniprot_header("sp|P12345|PROT_HUMAN Protein description OS=Homo sapiens") ('P12345', {'db': 'sp', 'entry_name': 'PROT_HUMAN', 'description': 'Protein description OS=Homo sapiens'}) >>> parse_uniprot_header("P12345|PROT_HUMAN Description") ('P12345', {'entry_name': 'PROT_HUMAN', 'description': 'Description'}) >>> parse_uniprot_header("P12345") ('P12345', {})
- is_zero_based: bool = False
- label_columns: List[str]
- pipeline_name: str = 'human_domainome_cleaner'
- primary_label_column: str = 'label_humanDomainome'
- process_workers: int = 16
- sequence_dict_path: str | Path
- type_conversions: Dict[str, str]
- class tidymut.cleaners.K50CleanerConfig(pipeline_name: str = 'k50_cleaner', num_workers: int = 16, validate_config: bool = True, column_mapping: ~typing.Dict[str, str] = <factory>, filters: ~typing.Dict[str, ~typing.Callable] = <factory>, type_conversions: ~typing.Dict[str, str] = <factory>, validation_workers: int = 16, infer_wt_workers: int = 16, handle_multiple_wt: ~typing.Literal['error', 'first', 'separate'] = 'error', label_columns: ~typing.List[str] = <factory>, primary_label_column: str = 'ddG')[source]
Bases:
BaseCleanerConfig
Configuration class for K50 dataset cleaner
Inherits from BaseCleanerConfig and adds K50-specific configuration options.
- column_mapping
Mapping from source to target column names
- Type:
Dict[str, str]
- filters
Filter conditions for data cleaning
- Type:
Dict[str, Callable]
- type_conversions
Data type conversion specifications
- Type:
Dict[str, str]
- validation_workers
Number of workers for mutation validation, set to -1 to use all available CPUs
- Type:
int
- infer_wt_workers
Number of workers for wildtype sequence inference, set to -1 to use all available CPUs
- Type:
int
- handle_multiple_wt
Strategy for handling multiple wildtype sequences (‘error’, ‘first’, ‘separate’)
- Type:
Literal[“error”, “first”, “separate”], default=”error”
- label_columns
List of score columns to process
- Type:
List[str]
- primary_label_column
Primary score column for the dataset
- Type:
str
- column_mapping: Dict[str, str]
- filters: Dict[str, Callable]
- handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error'
- infer_wt_workers: int = 16
- label_columns: List[str]
- pipeline_name: str = 'k50_cleaner'
- primary_label_column: str = 'ddG'
- type_conversions: Dict[str, str]
- validate() None [source]
Validate K50-specific configuration parameters
- Raises:
ValueError – If configuration is invalid
- validation_workers: int = 16
- class tidymut.cleaners.ProteinGymCleanerConfig(pipeline_name: str = 'protein_gym_cleaner', num_workers: int = 16, validate_config: bool = True, column_mapping: Dict[str, str] = <factory>, filters: Dict[str, Any] = <factory>, type_conversions: Dict[str, str] = <factory>, validation_workers: int = 16, infer_wt_workers: int = 16, handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error', label_columns: List[str] = <factory>, primary_label_column: str = 'DMS_score')[source]
Bases:
BaseCleanerConfig
Configuration class for ProteinGym dataset cleaner
Inherits from BaseCleanerConfig and adds ProteinGym-specific configuration options.
- column_mapping
Mapping from source to target column names
- Type:
Dict[str, str]
- filters
Filter conditions for data cleaning
- Type:
Dict[str, Any]
- type_conversions
Data type conversion specifications
- Type:
Dict[str, str]
- is_zero_based
Whether mutation positions are zero-based
- Type:
bool
- validation_workers
Number of workers for mutation validation
- Type:
int
- infer_wt_workers
Number of workers for wildtype sequence inference
- Type:
int
- handle_multiple_wt
Strategy for handling multiple wildtype sequences
- Type:
Literal[“error”, “first”, “separate”]
- label_columns
List of score columns to process
- Type:
List[str]
- primary_label_column
Primary score column for the dataset
- Type:
str
- column_mapping: Dict[str, str]
- filters: Dict[str, Any]
- handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error'
- infer_wt_workers: int = 16
- label_columns: List[str]
- pipeline_name: str = 'protein_gym_cleaner'
- primary_label_column: str = 'DMS_score'
- type_conversions: Dict[str, str]
- validate() None [source]
Validate ProteinGym-specific configuration parameters
- Raises:
ValueError – If configuration is invalid
- validation_workers: int = 16
- tidymut.cleaners.clean_human_domainome_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset] [source]
Clean HumanDomainome dataset using configurable pipeline
- Parameters:
pipeline (Pipeline) – HumanDomainome dataset cleaning pipeline
- Returns:
Pipeline: The cleaned pipeline
MutationDataset: The cleaned HumanDomainome dataset
- Return type:
Tuple[Pipeline, MutationDataset]
- Raises:
RuntimeError – If pipeline execution fails
- tidymut.cleaners.clean_k50_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset] [source]
Clean K50 dataset using configurable pipeline
- Parameters:
pipeline (Pipeline) – K50 dataset cleaning pipeline
- Returns:
Pipeline: The cleaned pipeline
MutationDataset: The cleaned K50 dataset
- Return type:
Tuple[Pipeline, MutationDataset]
- tidymut.cleaners.clean_protein_gym_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset] [source]
Clean ProteinGym dataset using configurable pipeline
- Parameters:
pipeline (Pipeline) – ProteinGym dataset cleaning pipeline
- Returns:
Pipeline: The cleaned pipeline
MutationDataset: The cleaned ProteinGym dataset
- Return type:
Tuple[Pipeline, MutationDataset]
- tidymut.cleaners.create_human_domainome_cleaner(dataset_or_path: str | Path, sequence_dict_path: str | Path, config: HumanDomainomeCleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline [source]
Create HumanDomainome dataset cleaning pipeline
- Parameters:
dataset_or_path (Union[pd.DataFrame, str, Path]) –
Raw HumanDomainome dataset DataFrame or file path to K50 HumanDomainome - File: SupplementaryTable4.txt from the article
’Site-saturation mutagenesis of 500 human protein domains’
sequence_dict_path (Union[str, Path]) – Path to file containing UniProt ID to sequence mapping
config (Optional[Union[HumanDomainomeCleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - HumanDomainomeCleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)
- Returns:
The cleaning pipeline
- Return type:
- Raises:
FileNotFoundError – If data file or sequence dictionary file not found
TypeError – If config has invalid type
ValueError – If configuration validation fails
Examples
Basic usage: >>> pipeline = create_human_domainome_cleaner( … “human_domainome.csv”, … “uniprot_sequences.fasta” … ) >>> pipeline, dataset = clean_human_domainome_dataset(pipeline)
Custom configuration: >>> config = { … “process_workers”: 8, … “type_conversions”: {“label_humanDomainome”: “float32”} … } >>> pipeline = create_human_domainome_cleaner( … “human_domainome.csv”, … “sequences.csv”, … config=config … )
Load configuration from file: >>> pipeline = create_human_domainome_cleaner( … “data.csv”, … “sequences.fasta”, … config=”config.json” … )
- tidymut.cleaners.create_k50_cleaner(dataset_or_path: DataFrame | str | Path, config: K50CleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline [source]
Create K50 dataset cleaning pipeline
- Parameters:
dataset_or_path (Union[pd.DataFrame, str, Path]) – Raw K50 dataset DataFrame or file path to K50 dataset - Download from: https://zenodo.org/records/799292 - File: Tsuboyama2023_Dataset2_Dataset3_20230416.csv in Processed_K50_dG_datasets.zip
config (Optional[Union[K50CleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - K50CleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)
- Returns:
Pipeline: The cleaning pipeline used
- Return type:
- Raises:
TypeError – If config has invalid type
ValueError – If configuration validation fails
Examples
Use default configuration: >>> pipeline, dataset = clean_k50_dataset(df)
Use partial configuration: >>> pipeline, dataset = clean_k50_dataset(df, config={ … “validation_workers”: 8, … “handle_multiple_wt”: “first” … })
Load configuration from file: >>> pipeline, dataset = clean_k50_dataset(df, config=”config.json”)
- tidymut.cleaners.create_protein_gym_cleaner(data_path: str | Path, config: ProteinGymCleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline [source]
Create ProteinGym dataset cleaning pipeline
- Parameters:
data_path (Union[str, Path]) – Path to directory containing ProteinGym CSV files or path to zip file - Download from: https://proteingym.org/download - File: DMS_ProteinGym_substitutions.zip
config (Optional[Union[ProteinGymCleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - ProteinGymCleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)
- Returns:
The cleaning pipeline
- Return type:
- Raises:
TypeError – If config has invalid type
ValueError – If configuration validation fails
Examples
Process directory of ProteinGym CSV files: >>> pipeline = create_protein_gym_cleaner(“DMS_ProteinGym_substitutions/”) >>> pipeline, dataset = clean_protein_gym_dataset(pipeline)
Process zip file: >>> pipeline = create_protein_gym_cleaner(“DMS_ProteinGym_substitutions.zip”) >>> pipeline, dataset = clean_protein_gym_dataset(pipeline)
Custom configuration: >>> config = { … “validation_workers”: 8, … “handle_multiple_wt”: “first” … } >>> pipeline = create_protein_gym_cleaner(“data/”, config=config)
Load configuration from file: >>> pipeline = create_protein_gym_cleaner(“data/”, config=”config.json”)