tidymut.cleaners.human_domainome_custom_cleaners module

tidymut.cleaners.human_domainome_custom_cleaners.add_sequences_to_dataset(dataset: pd.DataFrame, sequence_dict: Dict[str, str], name_column: str = 'name') Tuple[pd.DataFrame, pd.DataFrame][source]

Add full wild-type sequences to the dataset from sequence dictionary

This function maps sequences from a dictionary to the dataset. Records without matching sequences are separated into the failed dataset.

Parameters:
  • dataset (pd.DataFrame) – Dataset containing protein names

  • sequence_dict (Dict[str, str]) – Mapping from protein name to full wild-type sequence

  • name_column (str, default='name') – Column name containing protein identifiers

Returns:

(successful_dataset, failed_dataset) - datasets with and without sequences

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['prot1', 'prot2', 'prot3'],
...     'score': [1.0, 2.0, 3.0]
... })
>>> seq_dict = {'prot1': 'AKCD', 'prot2': 'EFGH'}
>>> successful, failed = add_sequences_to_dataset(df, seq_dict)
>>> print(len(successful))  # Should be 2
2
>>> print(len(failed))  # Should be 1
1
tidymut.cleaners.human_domainome_custom_cleaners.extract_domain_sequences(dataset: pd.DataFrame, sequence_column: str = 'sequence', start_pos_column: str = 'start_pos', end_pos_column: str = 'end_pos', num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame][source]

Extract domain sequences from full sequences using position information

This function extracts domain subsequences based on start and end positions. Records with invalid positions or missing sequences are separated into the failed dataset.

Parameters:
  • dataset (pd.DataFrame) – Dataset with full sequences and position information

  • sequence_column (str, default='sequence') – Column containing full wild-type sequences

  • start_pos_column (str, default='start_pos') – Column containing domain start positions (0-based)

  • end_pos_column (str, default='end_pos') – Column containing domain end positions

  • num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs

Returns:

(successful_dataset, failed_dataset) - datasets with and without extraction errors

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['prot1', 'prot2', 'prot3'],
...     'sequence': ['ABCDEFGHIJ', 'KLMNOPQRST', None],
...     'start_pos': [2, 0, 5],
...     'end_pos': [7, 4, 10]
... })
>>> successful, failed = extract_domain_sequences(df)
>>> print(successful['sequence'].tolist())
['CDEFG', 'KLMN']
>>> print(len(failed))  # Should be 1 (the None sequence)
1
tidymut.cleaners.human_domainome_custom_cleaners.process_domain_positions(dataset: pd.DataFrame) Tuple[pd.DataFrame, pd.DataFrame][source]

Process domain position information from PFAM entries

This function extracts position information from PFAM entries and calculates relative mutation positions. It handles parsing errors by separating failed records.

Parameters:

dataset (pd.DataFrame) – Dataset with PFAM_entry column containing position information

Returns:

(successful_dataset, failed_dataset) - datasets with and without parsing errors

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'PFAM_entry': ['PF00001/10-100', 'PF00002/20-200', 'invalid_entry'],
...     'pos': [15, 25, 30],
...     'wt_aa': ['A', 'C', 'D'],
...     'mut_aa': ['K', 'Y', 'E']
... })
>>> successful, failed = process_domain_positions(df)
>>> print(len(successful))  # Should be 2
2
>>> print(len(failed))  # Should be 1
1