tidymut.cleaners.basic_cleaners module

tidymut.cleaners.basic_cleaners.apply_mutations_to_sequences(dataset: pd.DataFrame, sequence_column: str = 'sequence', name_column: str = 'name', mutation_column: str = 'mut_info', position_columns: Dict[str, str] | None = None, mutation_sep: str = ',', is_zero_based: bool = True, sequence_type: str = 'protein', num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame][source]

Apply mutations to sequences to generate mutated sequences.

This function takes mutation information and applies it to wild-type sequences to generate the corresponding mutated sequences. It supports parallel processing and can handle position-based sequence extraction.

Parameters:
  • dataset (pd.DataFrame) – Input dataset containing mutation information and sequence data

  • sequence_column (str, default='sequence') – Column name containing wild-type sequences

  • name_column (str, default='name') – Column name containing protein identifiers

  • mutation_column (str, default='mut_info') – Column name containing mutation information

  • position_columns (Optional[Dict[str, str]], default=None) – Position column mapping {“start”: “start_col”, “end”: “end_col”} Used for extracting sequence regions

  • mutation_sep (str, default=',') – Separator used to split multiple mutations in a single string

  • is_zero_based (bool, default=True) – Whether origin mutation positions are zero-based

  • sequence_type (str, default='protein') – Type of sequence (‘protein’, ‘dna’, ‘rna’)

  • num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs

Returns:

(successful_dataset, failed_dataset) - datasets with and without errors

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['prot1', 'prot1', 'prot2'],
...     'sequence': ['AKCDEF', 'AKCDEF', 'FEGHIS'],
...     'mut_info': ['A0K', 'C2D', 'E1F'],
...     'score': [1.0, 2.0, 3.0]
... })
>>> successful, failed = apply_mutations_to_sequences(df)
>>> print(successful['mut_seq'].tolist())
['KKCDEF', 'AKDDEF', 'FFGHIS']
>>> print(len(failed))  # Should be 0 if all mutations are valid
0
tidymut.cleaners.basic_cleaners.convert_data_types(dataset: pd.DataFrame, type_conversions: Dict[str, str | Type | np.dtype], handle_errors: str = 'coerce', optimize_memory: bool = True, use_batch_processing: bool = False, chunk_size: int = 10000) pd.DataFrame[source]

Convert data types for specified columns.

This function provides unified data type conversion with error handling options. Supports pandas, numpy, and Python built-in types with memory optimization.

Parameters:
  • dataset (pd.DataFrame) – Input dataset with columns to be converted

  • type_conversions (Dict[str, Union[str, Type, np.dtype]]) – Type conversion mapping in format {column_name: target_type} Supported formats: - String types: ‘float’, ‘int’, ‘str’, ‘category’, ‘bool’, ‘datetime’ - Numpy types: np.float32, np.float64, np.int32, np.int64, etc. - Pandas types: ‘Int64’, ‘Float64’, ‘string’, ‘boolean’ - Python types: float, int, str, bool

  • handle_errors (str, default='coerce') – Error handling strategy: ‘raise’, ‘coerce’, or ‘ignore’

  • optimize_memory (bool, default=True) – Whether to automatically optimize memory usage by choosing smaller dtypes

  • use_batch_processing (bool, default=False) – Whether to use batch processing for large datasets

  • chunk_size (int, default=10000) – Chunk size when using batch processing

Returns:

Dataset with converted data types

Return type:

pd.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
...     'score': ['1.5', '2.3', '3.7'],
...     'count': ['10', '20', '30'],
...     'name': [123, 456, 789],
...     'flag': ['True', 'False', 'True']
... })
>>> conversions = {
...     'score': np.float32,
...     'count': 'Int64',
...     'name': 'string',
...     'flag': 'boolean'
... }
>>> result = convert_data_types(df, conversions)
tidymut.cleaners.basic_cleaners.convert_to_mutation_dataset_format(df: pd.DataFrame, name_column: str = 'name', mutation_column: str = 'mut_info', sequence_column: str | None = None, mutated_sequence_column: str = 'mut_seq', sequence_type: Literal['protein', 'dna', 'rna'] = 'protein', label_column: str = 'score', include_wild_type: bool = False, mutation_set_prefix: str = 'set', is_zero_based: bool = False, additional_metadata: Dict[str, Any] | None = None) Tuple[pd.DataFrame, Dict[str, str]][source]

Convert a mutation DataFrame to the format required by MutationDataset.from_dataframe().

This function supports two input formats: 1. Format with WT rows: Contains explicit ‘WT’ entries with wild-type sequences 2. Format with sequence column: Each row contains the wild-type sequence

Parameters:
  • df (pd.DataFrame) –

    Input DataFrame. Supports two formats:

    Format 1 (with WT rows): - name: protein identifier - mut_info: mutation info (‘A0S’) or ‘WT’ for wild-type - mut_seq: mutated or wild-type sequence - score: numerical score

    Format 2 (with sequence column): - name: protein identifier - sequence: wild-type sequence - mut_info: mutation info (‘A0S’) - mut_seq: mutated sequence - score: numerical score

  • name_column (str, default='name') – Column name containing protein/sequence identifiers.

  • mutation_column (str, default='mut_info') – Column name containing mutation information. Expected formats: - ‘A0S’: amino acid mutation (wild_type + position + mutant_type) - ‘WT’: wild-type sequence (only in Format 1)

  • sequence_column (Optional[str], default=None) – Column name containing wild-type sequences (Format 2 only). If provided, assumes Format 2. If None, assumes Format 1.

  • mutated_sequence_column (Optional[str], default='mut_seq') – Column name containing the mutated sequences.

  • label_column (str, default='score') – Column name containing scores or other numerical values.

  • include_wild_type (bool, default=False) – Whether to include wild-type (WT) entries in the output. Only applies to Format 1 with explicit WT rows.

  • mutation_set_prefix (str, default='set') – Prefix used for generating mutation set IDs.

  • is_zero_based (bool, default=False) – Whether mutation positions are zero-based.

  • additional_metadata (Optional[Dict[str, Any]], default=None) – Additional metadata to add to all mutation sets.

Returns:

(converted_dataframe, reference_sequences_dict)

converted_dataframe: DataFrame in MutationDataset.from_dataframe() format reference_sequences_dict: Dictionary mapping reference_id to wild-type sequences (extracted from WT rows in Format 1 or sequence column in Format 2)

Return type:

Tuple[pd.DataFrame, Dict[str, str]]

Raises:

ValueError – If required columns are missing or mutation strings cannot be parsed.

Examples

>>> import pandas as pd

Format 1: With WT rows and multi-mutations

>>> df1 = pd.DataFrame({
...     'name': ['prot1', 'prot1', 'prot1', 'prot2', 'prot2'],
...     'mut_info': ['A0S,Q1D', 'C2D', 'WT', 'E0F', 'WT'],
...     'mut_seq': ['SDCDEF', 'AQDDEF', 'AQCDEF', 'FGHIGHK', 'EGHIGHK'],
...     'score': [1.5, 2.0, 0.0, 3.0, 0.0]
... })
>>> result_df1, ref_seqs1 = convert_to_mutation_dataset_format(df1)
>>> # Input has 5 rows but output has 6 rows (A0S,Q1D -> 2 rows)

Format 2: With sequence column and multi-mutations

>>> df2 = pd.DataFrame({
...     'name': ['prot1', 'prot1', 'prot2'],
...     'sequence': ['AKCDEF', 'AKCDEF', 'FEGHIS'],
...     'mut_info': ['A0K,C2D', 'Q1P', 'E1F'],
...     'score': [1.5, 2.0, 3.0],
...     'mut_seq': ['KKDDEF', 'APCDEF', 'FFGHIS']
... })
>>> result_df2, ref_seqs2 = convert_to_mutation_dataset_format(
...     df2, sequence_column='sequence'
... )
>>> print(ref_seqs2['prot1'])
AKCDEF
>>> # First row generates 2 output rows for A0K and C2D mutations
tidymut.cleaners.basic_cleaners.extract_and_rename_columns(dataset: pd.DataFrame, column_mapping: Dict[str, str], required_columns: Sequence[str] | None = None) pd.DataFrame[source]

Extract useful columns and rename them to standard format.

This function extracts specified columns from the input dataset and renames them according to the provided mapping. It helps standardize column names across different datasets.

Parameters:
  • dataset (pd.DataFrame) – Input dataset containing the data to be processed

  • column_mapping (Dict[str, str]) – Column name mapping from original names to new names Format: {original_column_name: new_column_name}

  • required_columns (Optional[Sequence[str]], default=None) – Required column names. If None, extracts all mapped columns

Returns:

Dataset with extracted and renamed columns

Return type:

pd.DataFrame

Raises:

ValueError – If required columns are missing from the input dataset

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'uniprot_ID': ['P12345', 'Q67890'],
...     'mutation_type': ['A123B', 'C456D'],
...     'score_value': [1.5, -2.3],
...     'extra_col': ['x', 'y']
... })
>>> mapping = {
...     'uniprot_ID': 'name',
...     'mutation_type': 'mut_info',
...     'score_value': 'label'
... }
>>> result = extract_and_rename_columns(df, mapping)
>>> print(result.columns.tolist())
['name', 'mut_info', 'label']
tidymut.cleaners.basic_cleaners.filter_and_clean_data(dataset: pd.DataFrame, filters: Dict[str, Any | Callable[[pd.Series], pd.Series]] | None = None, exclude_patterns: Dict[str, str | List[str]] | None = None, drop_na_columns: List[str] | None = None) pd.DataFrame[source]

Filter and clean data based on specified conditions.

This function provides flexible data filtering and cleaning capabilities, including value-based filtering, pattern exclusion, and null value removal.

Parameters:
  • dataset (pd.DataFrame) – Input dataset to be filtered and cleaned

  • filters (Optional[Dict[str, Union[Any, Callable[[pd.Series], pd.Series]]]], default=None) – Filter conditions in format {column_name: condition_value_or_function} If value is callable, it will be applied to the column

  • exclude_patterns (Optional[Dict[str, Union[str, List[str]]]], default=None) – Exclusion patterns in format {column_name: regex_pattern_or_list} Rows matching these patterns will be excluded

  • drop_na_columns (Optional[List[str]], default=None) – List of column names where null values should be dropped

Returns:

Filtered and cleaned dataset

Return type:

pd.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'mut_type': ['A123B', 'wt', 'C456D', 'insert', 'E789F'],
...     'score': [1.5, 2.0, '-', 3.2, 4.1],
...     'quality': ['good', 'bad', 'good', 'good', None]
... })
>>> filters = {'score': lambda x: x != '-'}
>>> exclude_patterns = {'mut_type': ['wt', 'insert']}
>>> drop_na_columns = ['quality']
>>> result = filter_and_clean_data(df, filters, exclude_patterns, drop_na_columns)
>>> print(len(result))  # Should be 2 (A123B and E789F rows)
2
tidymut.cleaners.basic_cleaners.infer_wildtype_sequences(dataset: pd.DataFrame, name_column: str = 'name', mutation_column: str = 'mut_info', sequence_column: str = 'mut_seq', label_columns: List[str] | None = None, wt_label: float = 0.0, mutation_sep: str = ',', is_zero_based: bool = False, sequence_type: Literal['protein', 'dna', 'rna'] = 'protein', handle_multiple_wt: Literal['error', 'separate', 'first'] = 'error', num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame][source]

Infer wild-type sequences from mutated sequences and add WT rows.

This function takes mutated sequences and their corresponding mutations to infer the original wild-type sequences. For each protein, it adds WT row(s) to the dataset with the inferred wild-type sequence.

Parameters:
  • dataset (pd.DataFrame) – Input dataset containing mutated sequences and mutation information

  • name_column (str, default='name') – Column name containing protein identifiers

  • mutation_column (str, default='mut_info') – Column name containing mutation information

  • sequence_column (str, default='mut_seq') – Column name containing mutated sequences

  • label_columns (Optional[List[str]], default=None) – List of label column names to preserve

  • wt_label (float, default=0.0) – Wild type score for WT rows

  • mutation_sep (str, default=',') – Separator used to split multiple mutations in a single string

  • is_zero_based (bool, default=False) – Whether origin mutation positions are zero-based

  • sequence_type (str, default='protein') – Type of sequence (‘protein’, ‘dna’, ‘rna’)

  • handle_multiple_wt (Literal["error", "separate", "first"], default='error') – How to handle multiple wild-type sequences: ‘separate’, ‘first’, or ‘error’

  • num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs

Returns:

(successful_dataset, problematic_dataset) - datasets with added WT rows

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['prot1', 'prot1', 'prot2'],
...     'mut_info': ['A0S', 'C2D', 'E0F'],
...     'mut_seq': ['SQCDEF', 'AQDDEF', 'FGHIGHK'],
...     'score': [1.0, 2.0, 3.0]
... })
>>> success, failed = infer_wildtype_sequences(
...     df, label_columns=['score']
... )
>>> print(len(success))  # Should have original rows + WT rows
tidymut.cleaners.basic_cleaners.merge_columns(dataset: pd.DataFrame, columns_to_merge: List[str], new_column_name: str, separator: str = '_', drop_original: bool = False, na_rep: str | None = None, prefix: str | None = None, suffix: str | None = None, custom_formatter: Callable[[pd.Series], str] | None = None) pd.DataFrame[source]

Merge multiple columns into a single column using a separator

This function combines values from multiple columns into a new column, with flexible formatting options.

Parameters:
  • dataset (pd.DataFrame) – Input dataset

  • columns_to_merge (List[str]) – List of column names to merge

  • new_column_name (str) – Name for the new merged column

  • separator (str, default='_') – Separator to use between values

  • drop_original (bool, default=False) – Whether to drop the original columns after merging

  • na_rep (Optional[str], default=None) – String representation of NaN values. If None, NaN values are skipped.

  • prefix (Optional[str], default=None) – Prefix to add to the merged value

  • suffix (Optional[str], default=None) – Suffix to add to the merged value

  • custom_formatter (Optional[Callable], default=None) – Custom function to format each row. Takes a pd.Series and returns a string. If provided, ignores separator, prefix, suffix parameters.

Returns:

Dataset with the new merged column

Return type:

pd.DataFrame

Examples

Basic usage: >>> df = pd.DataFrame({ … ‘gene’: [‘BRCA1’, ‘TP53’, ‘EGFR’], … ‘position’: [100, 200, 300], … ‘mutation’: [‘A’, ‘T’, ‘G’] … }) >>> result = merge_columns(df, [‘gene’, ‘position’, ‘mutation’], ‘mutation_id’, separator=’_’) >>> print(result[‘mutation_id’]) 0 BRCA1_100_A 1 TP53_200_T 2 EGFR_300_G

With prefix and suffix: >>> result = merge_columns( … df, [‘gene’, ‘position’], ‘gene_pos’, … separator=’:’, prefix=’[’, suffix=’]’ … ) >>> print(result[‘gene_pos’]) 0 [BRCA1:100] 1 [TP53:200] 2 [EGFR:300]

Handling NaN values: >>> df_with_nan = pd.DataFrame({ … ‘col1’: [‘A’, ‘B’, None], … ‘col2’: [‘X’, None, ‘Z’], … ‘col3’: [1, 2, 3] … }) >>> result = merge_columns( … df_with_nan, [‘col1’, ‘col2’, ‘col3’], ‘merged’, … separator=’-’, na_rep=’NA’ … ) >>> print(result[‘merged’]) 0 A-X-1 1 B-NA-2 2 NA-Z-3

Custom formatter: >>> def format_mutation(row): … return f”{row[‘gene’]}:{row[‘position’]}{row[‘mutation’]}” >>> result = merge_columns( … df, [‘gene’, ‘position’, ‘mutation’], ‘hgvs’, … custom_formatter=format_mutation … ) >>> print(result[‘hgvs’]) 0 BRCA1:100A 1 TP53:200T 2 EGFR:300G

tidymut.cleaners.basic_cleaners.read_dataset(file_path: str | Path, file_format: str | None = None, **kwargs) pd.DataFrame[source]

Read dataset from specified file format and return as a pandas DataFrame.

Parameters:
  • file_path (Union[str, Path]) – Path to the dataset file

  • file_format (str) – Format of the dataset file (“csv”, “tsv”, “xlsx”, etc.)

  • kwargs (Dict[str, Any]) – Additional keyword arguments for file reading

Returns:

Dataset loaded from the specified file

Return type:

pd.DataFrame

Example

>>> # Specify file_format parameter
>>> df = read_dataset("data.csv", "csv")
>>>
>>> # Detect file_format automatically
>>> df = read_dataset("data.csv")
tidymut.cleaners.basic_cleaners.validate_mutations(dataset: pd.DataFrame, mutation_column: str = 'mut_info', format_mutations: bool = True, mutation_sep: str = ',', is_zero_based: bool = False, cache_results: bool = True, num_workers: int = 4) Tuple[pd.DataFrame, pd.DataFrame][source]

Validate and format mutation information.

This function validates mutation strings, optionally formats them to a standard representation, and separates valid and invalid mutations into different datasets. It supports caching for improved performance on datasets with repeated mutations.

Parameters:
  • dataset (pd.DataFrame) – Input dataset containing mutation information

  • mutation_column (str, default='mut_info') – Name of the column containing mutation information

  • format_mutations (bool, default=True) – Whether to format mutations to standard representation

  • mutation_sep (str, default=',') – Separator used to split multiple mutations in a single string (e.g., ‘A123B,C456D’)

  • is_zero_based (bool, default=False) – Whether origin mutation positions are zero-based

  • cache_results (bool, default=True) – Whether to cache formatting results for performance

  • num_workers (int, default=4) – Number of parallel workers for processing, set to -1 for all available CPUs

Returns:

(successful_dataset, failed_dataset) - datasets with valid and invalid mutations

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['protein1', 'protein1', 'protein2'],
...     'mut_info': ['A123S', 'C456D,E789F', 'InvalidMut'],
...     'score': [1.5, 2.3, 3.7]
... })
>>> successful, failed = validate_mutations(df, mutation_column='mut_info', mutation_sep=',')
>>> print(len(successful))  # Should be 2 (valid mutations)
2
>>> print(successful['mut_info'].tolist())  # Formatted mutations
['A123S', 'C456D,E789F']
>>> print(len(failed))  # Should be 1 (invalid mutation)
1
>>> print(failed['failed']['error_message'].iloc[0])  # Error message for failed mutation
'ValueError: No valid mutations could be parsed...'