tidymut.cleaners.basic_cleaners module
- tidymut.cleaners.basic_cleaners.apply_mutations_to_sequences(dataset: pd.DataFrame, sequence_column: str = 'sequence', name_column: str = 'name', mutation_column: str = 'mut_info', position_columns: Dict[str, str] | None = None, mutation_sep: str = ',', is_zero_based: bool = True, sequence_type: str = 'protein', num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame] [source]
Apply mutations to sequences to generate mutated sequences.
This function takes mutation information and applies it to wild-type sequences to generate the corresponding mutated sequences. It supports parallel processing and can handle position-based sequence extraction.
- Parameters:
dataset (pd.DataFrame) – Input dataset containing mutation information and sequence data
sequence_column (str, default='sequence') – Column name containing wild-type sequences
name_column (str, default='name') – Column name containing protein identifiers
mutation_column (str, default='mut_info') – Column name containing mutation information
position_columns (Optional[Dict[str, str]], default=None) – Position column mapping {“start”: “start_col”, “end”: “end_col”} Used for extracting sequence regions
mutation_sep (str, default=',') – Separator used to split multiple mutations in a single string
is_zero_based (bool, default=True) – Whether origin mutation positions are zero-based
sequence_type (str, default='protein') – Type of sequence (‘protein’, ‘dna’, ‘rna’)
num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs
- Returns:
(successful_dataset, failed_dataset) - datasets with and without errors
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['prot1', 'prot1', 'prot2'], ... 'sequence': ['AKCDEF', 'AKCDEF', 'FEGHIS'], ... 'mut_info': ['A0K', 'C2D', 'E1F'], ... 'score': [1.0, 2.0, 3.0] ... }) >>> successful, failed = apply_mutations_to_sequences(df) >>> print(successful['mut_seq'].tolist()) ['KKCDEF', 'AKDDEF', 'FFGHIS'] >>> print(len(failed)) # Should be 0 if all mutations are valid 0
- tidymut.cleaners.basic_cleaners.convert_data_types(dataset: pd.DataFrame, type_conversions: Dict[str, str | Type | np.dtype], handle_errors: str = 'coerce', optimize_memory: bool = True, use_batch_processing: bool = False, chunk_size: int = 10000) pd.DataFrame [source]
Convert data types for specified columns.
This function provides unified data type conversion with error handling options. Supports pandas, numpy, and Python built-in types with memory optimization.
- Parameters:
dataset (pd.DataFrame) – Input dataset with columns to be converted
type_conversions (Dict[str, Union[str, Type, np.dtype]]) – Type conversion mapping in format {column_name: target_type} Supported formats: - String types: ‘float’, ‘int’, ‘str’, ‘category’, ‘bool’, ‘datetime’ - Numpy types: np.float32, np.float64, np.int32, np.int64, etc. - Pandas types: ‘Int64’, ‘Float64’, ‘string’, ‘boolean’ - Python types: float, int, str, bool
handle_errors (str, default='coerce') – Error handling strategy: ‘raise’, ‘coerce’, or ‘ignore’
optimize_memory (bool, default=True) – Whether to automatically optimize memory usage by choosing smaller dtypes
use_batch_processing (bool, default=False) – Whether to use batch processing for large datasets
chunk_size (int, default=10000) – Chunk size when using batch processing
- Returns:
Dataset with converted data types
- Return type:
pd.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({ ... 'score': ['1.5', '2.3', '3.7'], ... 'count': ['10', '20', '30'], ... 'name': [123, 456, 789], ... 'flag': ['True', 'False', 'True'] ... }) >>> conversions = { ... 'score': np.float32, ... 'count': 'Int64', ... 'name': 'string', ... 'flag': 'boolean' ... } >>> result = convert_data_types(df, conversions)
- tidymut.cleaners.basic_cleaners.convert_to_mutation_dataset_format(df: pd.DataFrame, name_column: str = 'name', mutation_column: str = 'mut_info', sequence_column: str | None = None, mutated_sequence_column: str = 'mut_seq', sequence_type: Literal['protein', 'dna', 'rna'] = 'protein', label_column: str = 'score', include_wild_type: bool = False, mutation_set_prefix: str = 'set', is_zero_based: bool = False, additional_metadata: Dict[str, Any] | None = None) Tuple[pd.DataFrame, Dict[str, str]] [source]
Convert a mutation DataFrame to the format required by MutationDataset.from_dataframe().
This function supports two input formats: 1. Format with WT rows: Contains explicit ‘WT’ entries with wild-type sequences 2. Format with sequence column: Each row contains the wild-type sequence
- Parameters:
df (pd.DataFrame) –
Input DataFrame. Supports two formats:
Format 1 (with WT rows): - name: protein identifier - mut_info: mutation info (‘A0S’) or ‘WT’ for wild-type - mut_seq: mutated or wild-type sequence - score: numerical score
Format 2 (with sequence column): - name: protein identifier - sequence: wild-type sequence - mut_info: mutation info (‘A0S’) - mut_seq: mutated sequence - score: numerical score
name_column (str, default='name') – Column name containing protein/sequence identifiers.
mutation_column (str, default='mut_info') – Column name containing mutation information. Expected formats: - ‘A0S’: amino acid mutation (wild_type + position + mutant_type) - ‘WT’: wild-type sequence (only in Format 1)
sequence_column (Optional[str], default=None) – Column name containing wild-type sequences (Format 2 only). If provided, assumes Format 2. If None, assumes Format 1.
mutated_sequence_column (Optional[str], default='mut_seq') – Column name containing the mutated sequences.
label_column (str, default='score') – Column name containing scores or other numerical values.
include_wild_type (bool, default=False) – Whether to include wild-type (WT) entries in the output. Only applies to Format 1 with explicit WT rows.
mutation_set_prefix (str, default='set') – Prefix used for generating mutation set IDs.
is_zero_based (bool, default=False) – Whether mutation positions are zero-based.
additional_metadata (Optional[Dict[str, Any]], default=None) – Additional metadata to add to all mutation sets.
- Returns:
(converted_dataframe, reference_sequences_dict)
converted_dataframe: DataFrame in MutationDataset.from_dataframe() format reference_sequences_dict: Dictionary mapping reference_id to wild-type sequences (extracted from WT rows in Format 1 or sequence column in Format 2)
- Return type:
Tuple[pd.DataFrame, Dict[str, str]]
- Raises:
ValueError – If required columns are missing or mutation strings cannot be parsed.
Examples
>>> import pandas as pd
Format 1: With WT rows and multi-mutations
>>> df1 = pd.DataFrame({ ... 'name': ['prot1', 'prot1', 'prot1', 'prot2', 'prot2'], ... 'mut_info': ['A0S,Q1D', 'C2D', 'WT', 'E0F', 'WT'], ... 'mut_seq': ['SDCDEF', 'AQDDEF', 'AQCDEF', 'FGHIGHK', 'EGHIGHK'], ... 'score': [1.5, 2.0, 0.0, 3.0, 0.0] ... }) >>> result_df1, ref_seqs1 = convert_to_mutation_dataset_format(df1) >>> # Input has 5 rows but output has 6 rows (A0S,Q1D -> 2 rows)
Format 2: With sequence column and multi-mutations
>>> df2 = pd.DataFrame({ ... 'name': ['prot1', 'prot1', 'prot2'], ... 'sequence': ['AKCDEF', 'AKCDEF', 'FEGHIS'], ... 'mut_info': ['A0K,C2D', 'Q1P', 'E1F'], ... 'score': [1.5, 2.0, 3.0], ... 'mut_seq': ['KKDDEF', 'APCDEF', 'FFGHIS'] ... }) >>> result_df2, ref_seqs2 = convert_to_mutation_dataset_format( ... df2, sequence_column='sequence' ... ) >>> print(ref_seqs2['prot1']) AKCDEF >>> # First row generates 2 output rows for A0K and C2D mutations
- tidymut.cleaners.basic_cleaners.extract_and_rename_columns(dataset: pd.DataFrame, column_mapping: Dict[str, str], required_columns: Sequence[str] | None = None) pd.DataFrame [source]
Extract useful columns and rename them to standard format.
This function extracts specified columns from the input dataset and renames them according to the provided mapping. It helps standardize column names across different datasets.
- Parameters:
dataset (pd.DataFrame) – Input dataset containing the data to be processed
column_mapping (Dict[str, str]) – Column name mapping from original names to new names Format: {original_column_name: new_column_name}
required_columns (Optional[Sequence[str]], default=None) – Required column names. If None, extracts all mapped columns
- Returns:
Dataset with extracted and renamed columns
- Return type:
pd.DataFrame
- Raises:
ValueError – If required columns are missing from the input dataset
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'uniprot_ID': ['P12345', 'Q67890'], ... 'mutation_type': ['A123B', 'C456D'], ... 'score_value': [1.5, -2.3], ... 'extra_col': ['x', 'y'] ... }) >>> mapping = { ... 'uniprot_ID': 'name', ... 'mutation_type': 'mut_info', ... 'score_value': 'label' ... } >>> result = extract_and_rename_columns(df, mapping) >>> print(result.columns.tolist()) ['name', 'mut_info', 'label']
- tidymut.cleaners.basic_cleaners.filter_and_clean_data(dataset: pd.DataFrame, filters: Dict[str, Any | Callable[[pd.Series], pd.Series]] | None = None, exclude_patterns: Dict[str, str | List[str]] | None = None, drop_na_columns: List[str] | None = None) pd.DataFrame [source]
Filter and clean data based on specified conditions.
This function provides flexible data filtering and cleaning capabilities, including value-based filtering, pattern exclusion, and null value removal.
- Parameters:
dataset (pd.DataFrame) – Input dataset to be filtered and cleaned
filters (Optional[Dict[str, Union[Any, Callable[[pd.Series], pd.Series]]]], default=None) – Filter conditions in format {column_name: condition_value_or_function} If value is callable, it will be applied to the column
exclude_patterns (Optional[Dict[str, Union[str, List[str]]]], default=None) – Exclusion patterns in format {column_name: regex_pattern_or_list} Rows matching these patterns will be excluded
drop_na_columns (Optional[List[str]], default=None) – List of column names where null values should be dropped
- Returns:
Filtered and cleaned dataset
- Return type:
pd.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'mut_type': ['A123B', 'wt', 'C456D', 'insert', 'E789F'], ... 'score': [1.5, 2.0, '-', 3.2, 4.1], ... 'quality': ['good', 'bad', 'good', 'good', None] ... }) >>> filters = {'score': lambda x: x != '-'} >>> exclude_patterns = {'mut_type': ['wt', 'insert']} >>> drop_na_columns = ['quality'] >>> result = filter_and_clean_data(df, filters, exclude_patterns, drop_na_columns) >>> print(len(result)) # Should be 2 (A123B and E789F rows) 2
- tidymut.cleaners.basic_cleaners.infer_wildtype_sequences(dataset: pd.DataFrame, name_column: str = 'name', mutation_column: str = 'mut_info', sequence_column: str = 'mut_seq', label_columns: List[str] | None = None, wt_label: float = 0.0, mutation_sep: str = ',', is_zero_based: bool = False, sequence_type: Literal['protein', 'dna', 'rna'] = 'protein', handle_multiple_wt: Literal['error', 'separate', 'first'] = 'error', num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame] [source]
Infer wild-type sequences from mutated sequences and add WT rows.
This function takes mutated sequences and their corresponding mutations to infer the original wild-type sequences. For each protein, it adds WT row(s) to the dataset with the inferred wild-type sequence.
- Parameters:
dataset (pd.DataFrame) – Input dataset containing mutated sequences and mutation information
name_column (str, default='name') – Column name containing protein identifiers
mutation_column (str, default='mut_info') – Column name containing mutation information
sequence_column (str, default='mut_seq') – Column name containing mutated sequences
label_columns (Optional[List[str]], default=None) – List of label column names to preserve
wt_label (float, default=0.0) – Wild type score for WT rows
mutation_sep (str, default=',') – Separator used to split multiple mutations in a single string
is_zero_based (bool, default=False) – Whether origin mutation positions are zero-based
sequence_type (str, default='protein') – Type of sequence (‘protein’, ‘dna’, ‘rna’)
handle_multiple_wt (Literal["error", "separate", "first"], default='error') – How to handle multiple wild-type sequences: ‘separate’, ‘first’, or ‘error’
num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs
- Returns:
(successful_dataset, problematic_dataset) - datasets with added WT rows
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['prot1', 'prot1', 'prot2'], ... 'mut_info': ['A0S', 'C2D', 'E0F'], ... 'mut_seq': ['SQCDEF', 'AQDDEF', 'FGHIGHK'], ... 'score': [1.0, 2.0, 3.0] ... }) >>> success, failed = infer_wildtype_sequences( ... df, label_columns=['score'] ... ) >>> print(len(success)) # Should have original rows + WT rows
- tidymut.cleaners.basic_cleaners.merge_columns(dataset: pd.DataFrame, columns_to_merge: List[str], new_column_name: str, separator: str = '_', drop_original: bool = False, na_rep: str | None = None, prefix: str | None = None, suffix: str | None = None, custom_formatter: Callable[[pd.Series], str] | None = None) pd.DataFrame [source]
Merge multiple columns into a single column using a separator
This function combines values from multiple columns into a new column, with flexible formatting options.
- Parameters:
dataset (pd.DataFrame) – Input dataset
columns_to_merge (List[str]) – List of column names to merge
new_column_name (str) – Name for the new merged column
separator (str, default='_') – Separator to use between values
drop_original (bool, default=False) – Whether to drop the original columns after merging
na_rep (Optional[str], default=None) – String representation of NaN values. If None, NaN values are skipped.
prefix (Optional[str], default=None) – Prefix to add to the merged value
suffix (Optional[str], default=None) – Suffix to add to the merged value
custom_formatter (Optional[Callable], default=None) – Custom function to format each row. Takes a pd.Series and returns a string. If provided, ignores separator, prefix, suffix parameters.
- Returns:
Dataset with the new merged column
- Return type:
pd.DataFrame
Examples
Basic usage: >>> df = pd.DataFrame({ … ‘gene’: [‘BRCA1’, ‘TP53’, ‘EGFR’], … ‘position’: [100, 200, 300], … ‘mutation’: [‘A’, ‘T’, ‘G’] … }) >>> result = merge_columns(df, [‘gene’, ‘position’, ‘mutation’], ‘mutation_id’, separator=’_’) >>> print(result[‘mutation_id’]) 0 BRCA1_100_A 1 TP53_200_T 2 EGFR_300_G
With prefix and suffix: >>> result = merge_columns( … df, [‘gene’, ‘position’], ‘gene_pos’, … separator=’:’, prefix=’[’, suffix=’]’ … ) >>> print(result[‘gene_pos’]) 0 [BRCA1:100] 1 [TP53:200] 2 [EGFR:300]
Handling NaN values: >>> df_with_nan = pd.DataFrame({ … ‘col1’: [‘A’, ‘B’, None], … ‘col2’: [‘X’, None, ‘Z’], … ‘col3’: [1, 2, 3] … }) >>> result = merge_columns( … df_with_nan, [‘col1’, ‘col2’, ‘col3’], ‘merged’, … separator=’-’, na_rep=’NA’ … ) >>> print(result[‘merged’]) 0 A-X-1 1 B-NA-2 2 NA-Z-3
Custom formatter: >>> def format_mutation(row): … return f”{row[‘gene’]}:{row[‘position’]}{row[‘mutation’]}” >>> result = merge_columns( … df, [‘gene’, ‘position’, ‘mutation’], ‘hgvs’, … custom_formatter=format_mutation … ) >>> print(result[‘hgvs’]) 0 BRCA1:100A 1 TP53:200T 2 EGFR:300G
- tidymut.cleaners.basic_cleaners.read_dataset(file_path: str | Path, file_format: str | None = None, **kwargs) pd.DataFrame [source]
Read dataset from specified file format and return as a pandas DataFrame.
- Parameters:
file_path (Union[str, Path]) – Path to the dataset file
file_format (str) – Format of the dataset file (“csv”, “tsv”, “xlsx”, etc.)
kwargs (Dict[str, Any]) – Additional keyword arguments for file reading
- Returns:
Dataset loaded from the specified file
- Return type:
pd.DataFrame
Example
>>> # Specify file_format parameter >>> df = read_dataset("data.csv", "csv") >>> >>> # Detect file_format automatically >>> df = read_dataset("data.csv")
- tidymut.cleaners.basic_cleaners.validate_mutations(dataset: pd.DataFrame, mutation_column: str = 'mut_info', format_mutations: bool = True, mutation_sep: str = ',', is_zero_based: bool = False, cache_results: bool = True, num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame] [source]
Validate and format mutation information.
This function validates mutation strings, optionally formats them to a standard representation, and separates valid and invalid mutations into different datasets. It supports caching for improved performance on datasets with repeated mutations.
- Parameters:
dataset (pd.DataFrame) – Input dataset containing mutation information
mutation_column (str, default='mut_info') – Name of the column containing mutation information
format_mutations (bool, default=True) – Whether to format mutations to standard representation
mutation_sep (str, default=',') – Separator used to split multiple mutations in a single string (e.g., ‘A123B,C456D’)
is_zero_based (bool, default=False) – Whether origin mutation positions are zero-based
cache_results (bool, default=True) – Whether to cache formatting results for performance
num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs
- Returns:
(successful_dataset, failed_dataset) - datasets with valid and invalid mutations
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['protein1', 'protein1', 'protein2'], ... 'mut_info': ['A123S', 'C456D,E789F', 'InvalidMut'], ... 'score': [1.5, 2.3, 3.7] ... }) >>> successful, failed = validate_mutations(df, mutation_column='mut_info', mutation_sep=',') >>> print(len(successful)) # Should be 2 (valid mutations) 2 >>> print(successful['mut_info'].tolist()) # Formatted mutations ['A123S', 'C456D,E789F'] >>> print(len(failed)) # Should be 1 (invalid mutation) 1 >>> print(failed['failed']['error_message'].iloc[0]) # Error message for failed mutation 'ValueError: No valid mutations could be parsed...'