tidymut.cleaners.human_domainome_custom_cleaners module
- tidymut.cleaners.human_domainome_custom_cleaners.add_sequences_to_dataset(dataset: pd.DataFrame, sequence_dict: Dict[str, str], name_column: str = 'name') Tuple[pd.DataFrame, pd.DataFrame] [source]
Add full wild-type sequences to the dataset from sequence dictionary
This function maps sequences from a dictionary to the dataset. Records without matching sequences are separated into the failed dataset.
- Parameters:
dataset (pd.DataFrame) – Dataset containing protein names
sequence_dict (Dict[str, str]) – Mapping from protein name to full wild-type sequence
name_column (str, default='name') – Column name containing protein identifiers
- Returns:
(successful_dataset, failed_dataset) - datasets with and without sequences
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['prot1', 'prot2', 'prot3'], ... 'score': [1.0, 2.0, 3.0] ... }) >>> seq_dict = {'prot1': 'AKCD', 'prot2': 'EFGH'} >>> successful, failed = add_sequences_to_dataset(df, seq_dict) >>> print(len(successful)) # Should be 2 2 >>> print(len(failed)) # Should be 1 1
- tidymut.cleaners.human_domainome_custom_cleaners.extract_domain_sequences(dataset: pd.DataFrame, sequence_column: str = 'sequence', start_pos_column: str = 'start_pos', end_pos_column: str = 'end_pos', num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame] [source]
Extract domain sequences from full sequences using position information
This function extracts domain subsequences based on start and end positions. Records with invalid positions or missing sequences are separated into the failed dataset.
- Parameters:
dataset (pd.DataFrame) – Dataset with full sequences and position information
sequence_column (str, default='sequence') – Column containing full wild-type sequences
start_pos_column (str, default='start_pos') – Column containing domain start positions (0-based)
end_pos_column (str, default='end_pos') – Column containing domain end positions
num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs
- Returns:
(successful_dataset, failed_dataset) - datasets with and without extraction errors
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['prot1', 'prot2', 'prot3'], ... 'sequence': ['ABCDEFGHIJ', 'KLMNOPQRST', None], ... 'start_pos': [2, 0, 5], ... 'end_pos': [7, 4, 10] ... }) >>> successful, failed = extract_domain_sequences(df) >>> print(successful['sequence'].tolist()) ['CDEFG', 'KLMN'] >>> print(len(failed)) # Should be 1 (the None sequence) 1
- tidymut.cleaners.human_domainome_custom_cleaners.process_domain_positions(dataset: pd.DataFrame) Tuple[pd.DataFrame, pd.DataFrame] [source]
Process domain position information from PFAM entries
This function extracts position information from PFAM entries and calculates relative mutation positions. It handles parsing errors by separating failed records.
- Parameters:
dataset (pd.DataFrame) – Dataset with PFAM_entry column containing position information
- Returns:
(successful_dataset, failed_dataset) - datasets with and without parsing errors
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'PFAM_entry': ['PF00001/10-100', 'PF00002/20-200', 'invalid_entry'], ... 'pos': [15, 25, 30], ... 'wt_aa': ['A', 'C', 'D'], ... 'mut_aa': ['K', 'Y', 'E'] ... }) >>> successful, failed = process_domain_positions(df) >>> print(len(successful)) # Should be 2 2 >>> print(len(failed)) # Should be 1 1