tidymut.core package

Submodules

Module contents

Core functionality for sequence manipulation

class tidymut.core.AminoAcidMutationSet(mutations: Sequence[AminoAcidMutation], name: str | None = None, metadata: Dict[str, Any] | None = None)[source]

Bases: MutationSet[AminoAcidMutation]

Represents a set of amino acid mutations

count_by_effect_type() Dict[str, int][source]

Count mutations by effect type

get_missense_mutations() List[AminoAcidMutation][source]

Get all missense mutations

get_nonsense_mutations() List[AminoAcidMutation][source]

Get all nonsense mutations

get_synonymous_mutations() List[AminoAcidMutation][source]

Get all synonymous mutations

has_stop_codon_mutations() bool[source]

Check if any mutations introduce stop codons

class tidymut.core.CodonMutationSet(mutations: Sequence[CodonMutation], name: str | None = None, metadata: Dict[str, Any] | None = None)[source]

Bases: MutationSet[CodonMutation]

Represents a set of codon mutations

property seq_type: Literal['DNA', 'RNA', 'Both']

Get the sequence type (DNA, RNA, or Both) of the codon mutations

to_amino_acid_mutation_set(codon_table: CodonTable | None = None) AminoAcidMutationSet[source]

Convert all codon mutations to amino acid mutations

class tidymut.core.CodonTable(name: str, codon_map: Dict[str, str], start_codons: Collection[str] | None = None, stop_codons: Collection[str] | None = None)[source]

Bases: object

codon table used to translate codons to amino acids

classmethod get_standard_table(seq_type: Literal['DNA', 'RNA'] = 'DNA') CodonTable[source]

get standard codon table (NCBI standard)

classmethod get_table_by_name(name: str, seq_type: Literal['DNA', 'RNA'] = 'DNA') CodonTable[source]

get codon table by name

is_start_codon(codon: str) bool[source]

check if codon is a start codon

is_stop_codon(codon: str) bool[source]

check if codon is a stop codon

translate_codon(codon: str) str[source]

translate single codon to corresponding amino acid

class tidymut.core.DNAAlphabet(include_ambiguous: bool = False)[source]

Bases: BaseAlphabet

DNA alphabet (A, T, C, G)

class tidymut.core.DNASequence(sequence: str, alphabet: DNAAlphabet | None = None, name: str | None = None, metadata: Dict | None = None)[source]

Bases: BaseSequence

DNA sequence with nucleotide validation

reverse_complement() DNASequence[source]

Get reverse complement of DNA sequence

transcribe() RNASequence[source]

Transcribe DNA sequence into RNA sequence

translate(codon_table: CodonTable | None = None, start_at_first_met: bool = False, stop_at_stop_codon: bool = False, require_mod3: bool = True, start: int | None = None, end: int | None = None) ProteinSequence[source]

Translate DNA sequence into amino acid sequence using this codon table.

Parameters:
  • codon_table (Optional[CodonTable], default=None) – Codon table to use for translation. If None, uses standard genetic code.

  • start_at_first_met (bool, default=False) – Start translation at the first start codon if found.

  • stop_at_stop_codon (bool, default=False) – Stop translation when a stop codon is encountered.

  • require_mod3 (bool, default=True) – Whether the sequence must be a multiple of 3 in length.

  • start (Option[int], default=None) – Custom 0-based start position. Overrides start_at_first_met.

  • end (Option[int], default=None) – Custom 0-based end position. Overrides stop_at_stop_codon.

Returns:

Translated amino acid sequence.

Return type:

ProteinSequence

class tidymut.core.MutationDataset(name: str | None = None)[source]

Bases: object

Dataset container for cleaned mutation data with multiple reference sequences.

All mutation sets must be linked to a reference sequence when added to the dataset. This ensures data integrity and enables proper validation and analysis.

add_mutation_set(mutation_set: MutationSet, reference_id: str, label: float | None = None)[source]

Add a mutation set to the dataset, linking to a reference sequence

add_mutation_sets(mutation_sets: Sequence[MutationSet], reference_ids: Sequence[str], labels: Sequence[float] | None = None)[source]

Add multiple mutation sets to the dataset

add_reference_sequence(sequence_id: str, sequence: BaseSequence)[source]

Add a reference sequence with a unique identifier

convert_codon_to_amino_acid_sets(convert_labels: bool = False) MutationDataset[source]

Convert all codon mutation sets to amino acid mutation sets

Parameters:

convert_labels – Whether to save the labels with the mutation sets (default: False)

filter_by_effect_type(effect_type: str) MutationDataset[source]

Filter dataset by amino acid mutation effect type (synonymous, missense, nonsense)

filter_by_mutation_type(mutation_type: Type[BaseMutation]) MutationDataset[source]

Filter dataset by mutation type

filter_by_reference(reference_id: str) MutationDataset[source]

Filter dataset to only include mutation sets from a specific reference sequence

classmethod from_dataframe(df: pd.DataFrame, reference_sequences: Dict[str, BaseSequence], name: str | None = None, specific_mutation_type: Type[BaseMutation] | None = None) MutationDataset[source]

Create a MutationDataset from a DataFrame containing mutation data.

This method reconstructs a MutationDataset from a flattened DataFrame representation, typically used for loading saved mutation datasets from files. The DataFrame should contain mutation information with each row representing a single mutation within mutation sets.

Parameters:
  • df (pd.DataFrame) –

    DataFrame containing mutation data with the following required columns: - ‘mutation_set_id’: Identifier for grouping mutations into sets - ‘reference_id’: Identifier for the reference sequence - ‘mutation_string’: String representation of the mutation - ‘position’: Position of the mutation in the sequence - ‘mutation_type’: Type of mutation (‘amino_acid’, ‘codon_dna’, ‘codon_rna’)

    Optional columns include: - ‘mutation_set_name’: Name of the mutation set - ‘label’: Label associated with the mutation set - ‘wild_amino_acid’: Wild-type amino acid (for amino acid mutations) - ‘mutant_amino_acid’: Mutant amino acid (for amino acid mutations) - ‘wild_codon’: Wild-type codon (for codon mutations) - ‘mutant_codon’: Mutant codon (for codon mutations) - ‘set_*’: Columns with ‘set_’ prefix for mutation set metadata - ‘mutation_*’: Columns with ‘mutation_’ prefix for individual mutation metadata

  • reference_sequences (Dict[str, BaseSequence]) – Dictionary mapping reference sequence IDs to their corresponding BaseSequence objects. Must contain all reference sequences referenced in the DataFrame.

  • name (Optional[str], default=None) – Optional name for the created MutationDataset.

  • specific_mutation_type (Optional[BaseMutation], default=None) – The type of mutations to create. If None, will infer from first mutation must be provided when the mutation type is neither ‘amino_acid’ nor any ‘codon_*’ type.

Returns:

A new MutationDataset instance populated with the mutation sets and reference sequences from the DataFrame.

Return type:

MutationDataset

Raises:

ValueError – If the DataFrame is empty, missing required columns, or references sequences not provided in reference_sequences dict.

Notes

  • Mutations are grouped by ‘mutation_set_id’ to reconstruct mutation sets

  • The method automatically determines the appropriate mutation set type

(AminoAcidMutationSet, CodonMutationSet, or generic MutationSet) based on the mutation types within each set - Metadata is extracted from columns with ‘set_’ and ‘mutation_’ prefixes - Only reference sequences that are actually used in the DataFrame are added to the dataset

Examples

>>> import pandas as pd
>>> from sequences import ProteinSequence
>>>
>>> # Create sample DataFrame
>>> df = pd.DataFrame({
...     'mutation_set_id': ['set1', 'set1', 'set2'],
...     'reference_id': ['prot1', 'prot1', 'prot2'],
...     'mutation_string': ['A1V', 'L2P', 'G5R'],
...     'position': [1, 2, 5],
...     'mutation_type': ['amino_acid', 'amino_acid', 'amino_acid'],
...     'wild_amino_acid': ['A', 'L', 'G'],
...     'mutant_amino_acid': ['V', 'P', 'R'],
...     'mutation_set_name': ['variant1', 'variant1', 'variant2'],
...     'label': ['pathogenic', 'pathogenic', 'benign']
... })
>>>
>>> # Define reference sequences
>>> ref_seqs = {
...     'prot1': ProteinSequence('ALDEFG', name='protein1'),
...     'prot2': ProteinSequence('MKGLRK', name='protein2')
... }
>>>
>>> # Create MutationDataset
>>> dataset = MutationDataset.from_dataframe(df, ref_seqs, name="my_dataset")
>>> print(len(dataset.mutation_sets))
2
get_mutation_set_label(mutation_set_index: int) Any[source]

Get the label for a specific mutation set

get_mutation_set_reference(mutation_set_index: int) str[source]

Get the reference sequence ID for a specific mutation set

get_position_coverage(reference_id: str | None = None) Dict[str, Any][source]

Get statistics about position coverage across reference sequences

get_reference_sequence(sequence_id: str) BaseSequence[source]

Get a reference sequence by ID

get_statistics() Dict[str, Any][source]

Get basic statistics about the dataset

list_reference_sequences() List[str][source]

Get list of all reference sequence IDs

classmethod load(filepath: str, load_type: str | None = None) MutationDataset[source]

Load a dataset from files.

Parameters:
  • filepath (str) – Base filepath (with or without extension)

  • load_type (Optional[str], default=None) – Type of load format (“tidymut”, “dataframe” or “pickle”). If None, auto-detect from file extension.

Return type:

MutationDataset instance

Example

>>> # Auto-detect from extension
>>> dataset = MutationDataset.load("my_study.csv")
>>> dataset = MutationDataset.load("my_study.pkl")
>>> # Explicit type
>>> dataset = MutationDataset.load("my_study", "dataframe")
classmethod load_by_reference(base_dir: str | Path, dataset_name: str | None = None, is_zero_based: bool = True) MutationDataset[source]

Load a dataset from tidymut reference-based format.

Parameters:
  • base_dir (Union[str, Path]) – Base directory containing reference folders

  • dataset_name (Optional[str], default=None) – Optional name for the loaded dataset

  • is_zero_based (bool, default=True) – Whether origin mutation positions are zero-based

Returns:

  • MutationDataset instance

  • Expected directory structure – base_dir/ ├── reference_id_1/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json ├── reference_id_2/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json └── …

remove_mutation_set(mutation_set_index: int)[source]

Remove a mutation set from the dataset

remove_reference_sequence(sequence_id: str)[source]

Remove a reference sequence

save(filepath: str, save_type: Literal['tidymut', 'pickle', 'dataframe'] | None = 'tidymut')[source]

Save the dataset to files.

Parameters:
  • filepath – Base filepath (without extension)

  • save_type – Type of save format (“tidymut”, “dataframe” or “pickle”)

For save_type=”dataframe”:
  • Saves mutations as {filepath}.csv

  • Saves reference sequences as {filepath}_refs.pkl

  • Saves metadata as {filepath}_meta.json

For save_type=”pickle”:
  • Saves entire dataset as {filepath}.pkl

Example

dataset.save(“my_study”, “dataframe”) # Creates: my_study.csv, my_study_refs.pkl, my_study_meta.json

save_by_reference(base_dir: str | Path) None[source]

Save dataset by reference_id, creating separate folders for each reference.

Parameters:

base_dir – Base directory to create reference folders in

For each reference_id, creates:
  • {base_dir}/{reference_id}/data.csv: mutation data with columns [mutation_name, mutated_sequence, label]

  • {base_dir}/{reference_id}/wt.fasta: wild-type reference sequence

  • {base_dir}/{reference_id}/metadata.json: statistics and metadata for this reference

set_mutation_set_label(mutation_set_index: int, label: float)[source]

Set the label for a specific mutation set

set_mutation_set_reference(mutation_set_index: int, reference_id: str)[source]

Set the reference sequence for a specific mutation set

to_dataframe() DataFrame[source]

Convert dataset to pandas DataFrame

validate_against_references() Dict[str, Any][source]

Validate mutations against their reference sequences

class tidymut.core.Pipeline(data: Any = None, name: str | None = None, logging_level: str = 'INFO')[source]

Bases: object

Pipeline for processing data with pandas-style method chaining

add_delayed_step(func: Callable, index: int | None = None, *args, **kwargs) Pipeline[source]

Add a delayed step before a specific position in the delayed execution queue.

Performs a similar action to the list.insert() method.

Parameters:
  • func (Callable) – Function to add as delayed step

  • index (Optional[int]) – Position to insert the step. If None, appends to the end. Supports negative indexing.

  • *args – Arguments to pass to the function

  • **kwargs – Arguments to pass to the function

Returns:

Self for method chaining

Return type:

Pipeline

Examples

>>> # Add step at the beginning
>>> pipeline.add_delayed_step(func1, 0)
>>> # Add step at the end (same as delayed_then)
>>> pipeline.add_delayed_step(func2)
>>> # Insert step at position 2
>>> pipeline.add_delayed_step(func3, 2)
>>> # Insert step before the last one
>>> pipeline.add_delayed_step(func4, -1)
apply(func: Callable, *args, **kwargs) Pipeline[source]

Apply function and return new Pipeline (functional style)

property artifacts: Dict[str, Any]

Always return the artifacts dictionary.

This provides direct access to all stored artifacts from pipeline steps.

assign(**kwargs) Pipeline[source]

Add attributes or computed values to data

copy() Pipeline[source]

Create a deep copy of this pipeline

property data: Any

Always return the actual data, never PipelineOutput.

This ensures consistent user experience - pipeline.data can always be used with methods like .copy(), .append(), etc.

delayed_then(func: Callable, *args, **kwargs) Pipeline[source]

Add a function to the delayed execution queue without running it immediately

execute(steps: int | List[int] | None = None) Pipeline[source]

Execute delayed steps.

Parameters:

steps (Optional[Union[int, List[int]]]) – Which delayed steps to execute: - None: execute all delayed steps - int: execute the first N delayed steps - List[int]: execute specific delayed steps by index

Returns:

Self for method chaining

Return type:

Pipeline

filter(condition: Callable) Pipeline[source]

Filter data based on condition

get_all_artifacts() Dict[str, Any][source]

Get all stored artifacts

get_artifact(name: str) Any[source]

Get a specific artifact by name

get_data() Any[source]

Get current data (same as .data property).

Kept for backward compatibility.

get_delayed_steps_info() List[Dict[str, Any]][source]

Get information about delayed steps

get_execution_summary() Dict[str, Any][source]

Get summary of pipeline execution

get_step_result(step_index: int | str) Any[source]

Get result from a specific step by index or name

property has_pending_steps: bool

Check if there are delayed steps waiting to be executed

classmethod load(filepath: str, format: str = 'pickle', name: str | None = None) Pipeline[source]

Load data from file and create new pipeline

classmethod load_structured_data(filepath: str, format: str = 'pickle', name: str | None = None) Pipeline[source]

Load structured data from file and create new pipeline

peek(func: Callable | None = None, prefix: str = '') Pipeline[source]

Inspect data without modifying it (for debugging)

remove_delayed_step(index_or_name: int | str) Pipeline[source]

Remove a delayed step at the specified index.

Parameters:

index (int) – Index of the delayed step to remove

Returns:

Self for method chaining

Return type:

Pepline

Raises:

ValueError – If no delayed step is found with the specified index or name

save(filepath: str, format: str = 'pickle') Pipeline[source]

Save current data to file

save_artifacts(filepath: str, format: str = 'pickle') Pipeline[source]

Save all artifacts to file

save_structured_data(filepath: str, format: str = 'pickle') Pipeline[source]

Save structured data (data + artifacts) to file

store(name: str, extractor: Callable | None = None) Pipeline[source]

Store current data or extracted value as artifact

property structured_data: PipelineOutput

Return PipelineOutput object with both data and artifacts.

Use this when you need the complete pipeline state for serialization, passing to other systems, or when working with structured data flows.

then(func: Callable, *args, **kwargs) Pipeline[source]

Apply a function to the current data (pandas.pipe style)

transform(transformer: Callable, *args, **kwargs) Pipeline[source]

Alias of then, used to define format transformations.

validate(validator: Callable, error_msg: str = 'Validation failed') Pipeline[source]

Validate data and raise error if invalid

visualize_pipeline() str[source]

Generate a text visualization of the pipeline

class tidymut.core.ProteinAlphabet(include_stop: bool = True, include_ambiguous: bool = False)[source]

Bases: BaseAlphabet

Protein alphabet (20 standard amino acids + stop codon)

get_one_letter_code(three_letter: str, strict: bool = True) str[source]

Convert three-letter to one-letter amino acid code

get_three_letter_code(one_letter: str, strict: bool = True) str[source]

Convert one-letter to three-letter amino acid code

class tidymut.core.ProteinSequence(sequence: str, alphabet: ProteinAlphabet | None = None, name: str | None = None, metadata: Dict | None = None)[source]

Bases: BaseSequence

Protein sequence with amino acid validation

find_motif(motif: str) List[int][source]

Find all positions where motif occurs (0-indexed)

get_residue(position: int) str[source]

Get amino acid at specific position (0-indexed)

class tidymut.core.RNAAlphabet(include_ambiguous: bool = False)[source]

Bases: BaseAlphabet

RNA alphabet (A, U, C, G)

class tidymut.core.RNASequence(sequence: str, alphabet: RNAAlphabet | None = None, name: str | None = None, metadata: Dict | None = None)[source]

Bases: BaseSequence

RNA sequence with nucleotide validation

back_transcribe() DNASequence[source]

Back-transcribe RNA sequence into DNA sequence

reverse_complement() RNASequence[source]

Get reverse complement of RNA sequence

translate(codon_table: CodonTable | None = None, start_at_first_met: bool = False, stop_at_stop_codon: bool = False, require_mod3: bool = True, start: int | None = None, end: int | None = None) ProteinSequence[source]

Translate RNA sequence into amino acid sequence using this codon table.

Parameters:
  • codon_table (Optional[CodonTable], default=None) – Codon table to use for translation. If None, uses standard genetic code.

  • start_at_first_met (bool, default=False) – Start translation at the first start codon if found.

  • stop_at_stop_codon (bool, default=False) – Stop translation when a stop codon is encountered.

  • require_mod3 (bool, default=True) – Whether the sequence must be a multiple of 3 in length.

  • start (Option[int], default=None) – Custom 0-based start position. Overrides start_at_first_met.

  • end (Option[int], default=None) – Custom 0-based end position. Overrides stop_at_stop_codon.

Returns:

Translated amino acid sequence.

Return type:

ProteinSequence

tidymut.core.create_pipeline(data: Any, name: str | None = None, **kwargs) Pipeline[source]

Create a new pipeline with initial data

tidymut.core.multiout_step(**outputs: str)[source]

Decorator for multi-output pipeline functions.

Use this for functions that return multiple values where you want to name and access the outputs separately.

Parameters:

**outputs (str) – Named outputs. Use ‘main’ to specify which output is the main data flow. If ‘main’ is not specified, the first return value is treated as main.

Examples

>>> # Returns 3 values: main, stats, plot
>>> @multiout_step(stats="statistics", plot="visualization")
... def analyze_data(data):
...     ...
...     return processed_data, stats_dict, plot_object
>>> # Returns 3 values with explicit main designation
>>> @multiout_step(main="result", error="error_info", stats="statistics")
... def process_with_metadata(data):
...     ...
...     return result, error_info, stats

Note

With this decorator, side outputs are returned as a dictionary.

tidymut.core.pipeline_step(name: str | Callable[..., Any] | None = None)[source]

Decorator for single-output pipeline functions.

Use this for functions that return a single value (including tuples as single values). For multiple outputs, use @multiout_step instead.

Parameters:

name (Optional[str] or Callable) – Custom name for the step. If None, uses function name. When used as @pipeline_step (without parentheses), this will be the function.

Examples

>>> @pipeline_step
... def process(data):
...     return processed_data  # Single output
>>> @pipeline_step("process_data")
... def process(data):
...     return processed_data  # Single output
>>> @pipeline_step()
... def get_coordinates():
...     return (10, 20)  # Single tuple output