tidymut.cleaners package

Submodules

Module contents

class tidymut.cleaners.HumanDomainomeCleanerConfig(num_workers: int = 16, validate_config: bool = True, *, pipeline_name: str = 'human_domainome_cleaner', sequence_dict_path: Union[str, Path], header_parser: Callable[[str], Tuple[str, Dict[str, str]]] = <function parse_uniprot_header>, column_mapping: Dict[str, str] = <factory>, type_conversions: Dict[str, str] = <factory>, drop_na_columns: List = <factory>, is_zero_based: bool = False, process_workers: int = 16, label_columns: List[str] = <factory>, primary_label_column: str = 'label_humanDomainome')[source]

Bases: BaseCleanerConfig

Configuration class for HumanDomainome dataset cleaner

Inherits from BaseCleanerConfig and adds HumanDomainome-specific configuration options.

sequence_dict_path

Path to the file containing UniProt ID to sequence mapping

Type:

Union[str, Path]

header_parser

Parse Header in fasta files and extract relevant information

Type:

Callable[[str], Tuple[str, Dict[str, str]]]

column_mapping

Mapping from source to target column names

Type:

Dict[str, str]

type_conversions

Data type conversion specifications

Type:

Dict[str, str]

drop_na_columns

List of column names where null values should be dropped

Type:

List[str]

is_zero_based

Whether mutation positions are zero-based

Type:

bool

process_workers

Number of workers for parallel processing

Type:

int

label_columns

List of score columns to process

Type:

List[str]

primary_label_column

Primary score column for the dataset

Type:

str

column_mapping: Dict[str, str]
drop_na_columns: List
header_parser() Tuple[str, Dict[str, str]]

Parse UniProt FASTA header to extract ID and metadata

Parameters:

header (str) – FASTA header line (without ‘>’)

Returns:

(sequence_id, metadata_dict)

Return type:

Tuple[str, Dict[str, str]]

Examples

>>> parse_uniprot_header("sp|P12345|PROT_HUMAN Protein description OS=Homo sapiens")
('P12345', {'db': 'sp', 'entry_name': 'PROT_HUMAN', 'description': 'Protein description OS=Homo sapiens'})
>>> parse_uniprot_header("P12345|PROT_HUMAN Description")
('P12345', {'entry_name': 'PROT_HUMAN', 'description': 'Description'})
>>> parse_uniprot_header("P12345")
('P12345', {})
is_zero_based: bool = False
label_columns: List[str]
pipeline_name: str = 'human_domainome_cleaner'
primary_label_column: str = 'label_humanDomainome'
process_workers: int = 16
sequence_dict_path: str | Path
type_conversions: Dict[str, str]
validate() None[source]

Validate HumanDomainome-specific configuration parameters

Raises:

ValueError – If configuration is invalid

class tidymut.cleaners.K50CleanerConfig(pipeline_name: str = 'k50_cleaner', num_workers: int = 16, validate_config: bool = True, column_mapping: ~typing.Dict[str, str] = <factory>, filters: ~typing.Dict[str, ~typing.Callable] = <factory>, type_conversions: ~typing.Dict[str, str] = <factory>, validation_workers: int = 16, infer_wt_workers: int = 16, handle_multiple_wt: ~typing.Literal['error', 'first', 'separate'] = 'error', label_columns: ~typing.List[str] = <factory>, primary_label_column: str = 'ddG')[source]

Bases: BaseCleanerConfig

Configuration class for K50 dataset cleaner

Inherits from BaseCleanerConfig and adds K50-specific configuration options.

column_mapping

Mapping from source to target column names

Type:

Dict[str, str]

filters

Filter conditions for data cleaning

Type:

Dict[str, Callable]

type_conversions

Data type conversion specifications

Type:

Dict[str, str]

validation_workers

Number of workers for mutation validation, set to -1 to use all available CPUs

Type:

int

infer_wt_workers

Number of workers for wildtype sequence inference, set to -1 to use all available CPUs

Type:

int

handle_multiple_wt

Strategy for handling multiple wildtype sequences (‘error’, ‘first’, ‘separate’)

Type:

Literal[“error”, “first”, “separate”], default=”error”

label_columns

List of score columns to process

Type:

List[str]

primary_label_column

Primary score column for the dataset

Type:

str

column_mapping: Dict[str, str]
filters: Dict[str, Callable]
handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error'
infer_wt_workers: int = 16
label_columns: List[str]
pipeline_name: str = 'k50_cleaner'
primary_label_column: str = 'ddG'
type_conversions: Dict[str, str]
validate() None[source]

Validate K50-specific configuration parameters

Raises:

ValueError – If configuration is invalid

validation_workers: int = 16
class tidymut.cleaners.ProteinGymCleanerConfig(pipeline_name: str = 'protein_gym_cleaner', num_workers: int = 16, validate_config: bool = True, column_mapping: Dict[str, str] = <factory>, filters: Dict[str, Any] = <factory>, type_conversions: Dict[str, str] = <factory>, validation_workers: int = 16, infer_wt_workers: int = 16, handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error', label_columns: List[str] = <factory>, primary_label_column: str = 'DMS_score')[source]

Bases: BaseCleanerConfig

Configuration class for ProteinGym dataset cleaner

Inherits from BaseCleanerConfig and adds ProteinGym-specific configuration options.

column_mapping

Mapping from source to target column names

Type:

Dict[str, str]

filters

Filter conditions for data cleaning

Type:

Dict[str, Any]

type_conversions

Data type conversion specifications

Type:

Dict[str, str]

is_zero_based

Whether mutation positions are zero-based

Type:

bool

validation_workers

Number of workers for mutation validation

Type:

int

infer_wt_workers

Number of workers for wildtype sequence inference

Type:

int

handle_multiple_wt

Strategy for handling multiple wildtype sequences

Type:

Literal[“error”, “first”, “separate”]

label_columns

List of score columns to process

Type:

List[str]

primary_label_column

Primary score column for the dataset

Type:

str

column_mapping: Dict[str, str]
filters: Dict[str, Any]
handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error'
infer_wt_workers: int = 16
label_columns: List[str]
pipeline_name: str = 'protein_gym_cleaner'
primary_label_column: str = 'DMS_score'
type_conversions: Dict[str, str]
validate() None[source]

Validate ProteinGym-specific configuration parameters

Raises:

ValueError – If configuration is invalid

validation_workers: int = 16
tidymut.cleaners.clean_human_domainome_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset][source]

Clean HumanDomainome dataset using configurable pipeline

Parameters:

pipeline (Pipeline) – HumanDomainome dataset cleaning pipeline

Returns:

  • Pipeline: The cleaned pipeline

  • MutationDataset: The cleaned HumanDomainome dataset

Return type:

Tuple[Pipeline, MutationDataset]

Raises:

RuntimeError – If pipeline execution fails

tidymut.cleaners.clean_k50_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset][source]

Clean K50 dataset using configurable pipeline

Parameters:

pipeline (Pipeline) – K50 dataset cleaning pipeline

Returns:

  • Pipeline: The cleaned pipeline

  • MutationDataset: The cleaned K50 dataset

Return type:

Tuple[Pipeline, MutationDataset]

tidymut.cleaners.clean_protein_gym_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset][source]

Clean ProteinGym dataset using configurable pipeline

Parameters:

pipeline (Pipeline) – ProteinGym dataset cleaning pipeline

Returns:

  • Pipeline: The cleaned pipeline

  • MutationDataset: The cleaned ProteinGym dataset

Return type:

Tuple[Pipeline, MutationDataset]

tidymut.cleaners.create_human_domainome_cleaner(dataset_or_path: str | Path, sequence_dict_path: str | Path, config: HumanDomainomeCleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline[source]

Create HumanDomainome dataset cleaning pipeline

Parameters:
  • dataset_or_path (Union[pd.DataFrame, str, Path]) –

    Raw HumanDomainome dataset DataFrame or file path to K50 HumanDomainome - File: SupplementaryTable4.txt from the article

    ’Site-saturation mutagenesis of 500 human protein domains’

  • sequence_dict_path (Union[str, Path]) – Path to file containing UniProt ID to sequence mapping

  • config (Optional[Union[HumanDomainomeCleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - HumanDomainomeCleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)

Returns:

The cleaning pipeline

Return type:

Pipeline

Raises:
  • FileNotFoundError – If data file or sequence dictionary file not found

  • TypeError – If config has invalid type

  • ValueError – If configuration validation fails

Examples

Basic usage: >>> pipeline = create_human_domainome_cleaner( … “human_domainome.csv”, … “uniprot_sequences.fasta” … ) >>> pipeline, dataset = clean_human_domainome_dataset(pipeline)

Custom configuration: >>> config = { … “process_workers”: 8, … “type_conversions”: {“label_humanDomainome”: “float32”} … } >>> pipeline = create_human_domainome_cleaner( … “human_domainome.csv”, … “sequences.csv”, … config=config … )

Load configuration from file: >>> pipeline = create_human_domainome_cleaner( … “data.csv”, … “sequences.fasta”, … config=”config.json” … )

tidymut.cleaners.create_k50_cleaner(dataset_or_path: DataFrame | str | Path, config: K50CleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline[source]

Create K50 dataset cleaning pipeline

Parameters:
  • dataset_or_path (Union[pd.DataFrame, str, Path]) – Raw K50 dataset DataFrame or file path to K50 dataset - Download from: https://zenodo.org/records/799292 - File: Tsuboyama2023_Dataset2_Dataset3_20230416.csv in Processed_K50_dG_datasets.zip

  • config (Optional[Union[K50CleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - K50CleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)

Returns:

Pipeline: The cleaning pipeline used

Return type:

Pipeline

Raises:
  • TypeError – If config has invalid type

  • ValueError – If configuration validation fails

Examples

Use default configuration: >>> pipeline, dataset = clean_k50_dataset(df)

Use partial configuration: >>> pipeline, dataset = clean_k50_dataset(df, config={ … “validation_workers”: 8, … “handle_multiple_wt”: “first” … })

Load configuration from file: >>> pipeline, dataset = clean_k50_dataset(df, config=”config.json”)

tidymut.cleaners.create_protein_gym_cleaner(data_path: str | Path, config: ProteinGymCleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline[source]

Create ProteinGym dataset cleaning pipeline

Parameters:
  • data_path (Union[str, Path]) – Path to directory containing ProteinGym CSV files or path to zip file - Download from: https://proteingym.org/download - File: DMS_ProteinGym_substitutions.zip

  • config (Optional[Union[ProteinGymCleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - ProteinGymCleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)

Returns:

The cleaning pipeline

Return type:

Pipeline

Raises:
  • TypeError – If config has invalid type

  • ValueError – If configuration validation fails

Examples

Process directory of ProteinGym CSV files: >>> pipeline = create_protein_gym_cleaner(“DMS_ProteinGym_substitutions/”) >>> pipeline, dataset = clean_protein_gym_dataset(pipeline)

Process zip file: >>> pipeline = create_protein_gym_cleaner(“DMS_ProteinGym_substitutions.zip”) >>> pipeline, dataset = clean_protein_gym_dataset(pipeline)

Custom configuration: >>> config = { … “validation_workers”: 8, … “handle_multiple_wt”: “first” … } >>> pipeline = create_protein_gym_cleaner(“data/”, config=config)

Load configuration from file: >>> pipeline = create_protein_gym_cleaner(“data/”, config=”config.json”)