tidymut.core package
Submodules
- tidymut.core.alphabet module
- tidymut.core.codon module
- tidymut.core.constants module
- tidymut.core.dataset module
MutationDataset
MutationDataset.add_mutation_set()
MutationDataset.add_mutation_sets()
MutationDataset.add_reference_sequence()
MutationDataset.convert_codon_to_amino_acid_sets()
MutationDataset.filter_by_effect_type()
MutationDataset.filter_by_mutation_type()
MutationDataset.filter_by_reference()
MutationDataset.from_dataframe()
MutationDataset.get_mutation_set_label()
MutationDataset.get_mutation_set_reference()
MutationDataset.get_position_coverage()
MutationDataset.get_reference_sequence()
MutationDataset.get_statistics()
MutationDataset.list_reference_sequences()
MutationDataset.load()
MutationDataset.load_by_reference()
MutationDataset.remove_mutation_set()
MutationDataset.remove_reference_sequence()
MutationDataset.save()
MutationDataset.save_by_reference()
MutationDataset.set_mutation_set_label()
MutationDataset.set_mutation_set_reference()
MutationDataset.to_dataframe()
MutationDataset.validate_against_references()
- tidymut.core.mutation module
AminoAcidMutation
AminoAcidMutationSet
BaseMutation
CodonMutation
CodonMutationSet
MutationSet
MutationSet.add_mutation()
MutationSet.filter_by_category()
MutationSet.from_string()
MutationSet.get_mutation_at()
MutationSet.get_mutation_categories()
MutationSet.get_mutation_count()
MutationSet.get_positions()
MutationSet.get_positions_set()
MutationSet.get_sorted_by_position()
MutationSet.has_mutation_at()
MutationSet.is_multiple_mutations()
MutationSet.is_single_mutation()
MutationSet.mutation_subtype
MutationSet.remove_mutation()
MutationSet.sort_by_position()
MutationSet.validate_all()
- tidymut.core.pipeline module
Pipeline
Pipeline.add_delayed_step()
Pipeline.apply()
Pipeline.artifacts
Pipeline.assign()
Pipeline.copy()
Pipeline.data
Pipeline.delayed_then()
Pipeline.execute()
Pipeline.filter()
Pipeline.get_all_artifacts()
Pipeline.get_artifact()
Pipeline.get_data()
Pipeline.get_delayed_steps_info()
Pipeline.get_execution_summary()
Pipeline.get_step_result()
Pipeline.has_pending_steps
Pipeline.load()
Pipeline.load_structured_data()
Pipeline.peek()
Pipeline.remove_delayed_step()
Pipeline.save()
Pipeline.save_artifacts()
Pipeline.save_structured_data()
Pipeline.store()
Pipeline.structured_data
Pipeline.then()
Pipeline.transform()
Pipeline.validate()
Pipeline.visualize_pipeline()
create_pipeline()
multiout_step()
pipeline_step()
- tidymut.core.sequence module
- tidymut.core.types module
Module contents
Core functionality for sequence manipulation
- class tidymut.core.AminoAcidMutationSet(mutations: Sequence[AminoAcidMutation], name: str | None = None, metadata: Dict[str, Any] | None = None)[source]
Bases:
MutationSet
[AminoAcidMutation
]Represents a set of amino acid mutations
- get_missense_mutations() List[AminoAcidMutation] [source]
Get all missense mutations
- get_nonsense_mutations() List[AminoAcidMutation] [source]
Get all nonsense mutations
- get_synonymous_mutations() List[AminoAcidMutation] [source]
Get all synonymous mutations
- class tidymut.core.CodonMutationSet(mutations: Sequence[CodonMutation], name: str | None = None, metadata: Dict[str, Any] | None = None)[source]
Bases:
MutationSet
[CodonMutation
]Represents a set of codon mutations
- property seq_type: Literal['DNA', 'RNA', 'Both']
Get the sequence type (DNA, RNA, or Both) of the codon mutations
- to_amino_acid_mutation_set(codon_table: CodonTable | None = None) AminoAcidMutationSet [source]
Convert all codon mutations to amino acid mutations
- class tidymut.core.CodonTable(name: str, codon_map: Dict[str, str], start_codons: Collection[str] | None = None, stop_codons: Collection[str] | None = None)[source]
Bases:
object
codon table used to translate codons to amino acids
- classmethod get_standard_table(seq_type: Literal['DNA', 'RNA'] = 'DNA') CodonTable [source]
get standard codon table (NCBI standard)
- classmethod get_table_by_name(name: str, seq_type: Literal['DNA', 'RNA'] = 'DNA') CodonTable [source]
get codon table by name
- class tidymut.core.DNAAlphabet(include_ambiguous: bool = False)[source]
Bases:
BaseAlphabet
DNA alphabet (A, T, C, G)
- class tidymut.core.DNASequence(sequence: str, alphabet: DNAAlphabet | None = None, name: str | None = None, metadata: Dict | None = None)[source]
Bases:
BaseSequence
DNA sequence with nucleotide validation
- reverse_complement() DNASequence [source]
Get reverse complement of DNA sequence
- transcribe() RNASequence [source]
Transcribe DNA sequence into RNA sequence
- translate(codon_table: CodonTable | None = None, start_at_first_met: bool = False, stop_at_stop_codon: bool = False, require_mod3: bool = True, start: int | None = None, end: int | None = None) ProteinSequence [source]
Translate DNA sequence into amino acid sequence using this codon table.
- Parameters:
codon_table (Optional[CodonTable], default=None) – Codon table to use for translation. If None, uses standard genetic code.
start_at_first_met (bool, default=False) – Start translation at the first start codon if found.
stop_at_stop_codon (bool, default=False) – Stop translation when a stop codon is encountered.
require_mod3 (bool, default=True) – Whether the sequence must be a multiple of 3 in length.
start (Option[int], default=None) – Custom 0-based start position. Overrides start_at_first_met.
end (Option[int], default=None) – Custom 0-based end position. Overrides stop_at_stop_codon.
- Returns:
Translated amino acid sequence.
- Return type:
- class tidymut.core.MutationDataset(name: str | None = None)[source]
Bases:
object
Dataset container for cleaned mutation data with multiple reference sequences.
All mutation sets must be linked to a reference sequence when added to the dataset. This ensures data integrity and enables proper validation and analysis.
- add_mutation_set(mutation_set: MutationSet, reference_id: str, label: float | None = None)[source]
Add a mutation set to the dataset, linking to a reference sequence
- add_mutation_sets(mutation_sets: Sequence[MutationSet], reference_ids: Sequence[str], labels: Sequence[float] | None = None)[source]
Add multiple mutation sets to the dataset
- add_reference_sequence(sequence_id: str, sequence: BaseSequence)[source]
Add a reference sequence with a unique identifier
- convert_codon_to_amino_acid_sets(convert_labels: bool = False) MutationDataset [source]
Convert all codon mutation sets to amino acid mutation sets
- Parameters:
convert_labels – Whether to save the labels with the mutation sets (default: False)
- filter_by_effect_type(effect_type: str) MutationDataset [source]
Filter dataset by amino acid mutation effect type (synonymous, missense, nonsense)
- filter_by_mutation_type(mutation_type: Type[BaseMutation]) MutationDataset [source]
Filter dataset by mutation type
- filter_by_reference(reference_id: str) MutationDataset [source]
Filter dataset to only include mutation sets from a specific reference sequence
- classmethod from_dataframe(df: pd.DataFrame, reference_sequences: Dict[str, BaseSequence], name: str | None = None, specific_mutation_type: Type[BaseMutation] | None = None) MutationDataset [source]
Create a MutationDataset from a DataFrame containing mutation data.
This method reconstructs a MutationDataset from a flattened DataFrame representation, typically used for loading saved mutation datasets from files. The DataFrame should contain mutation information with each row representing a single mutation within mutation sets.
- Parameters:
df (pd.DataFrame) –
DataFrame containing mutation data with the following required columns: - ‘mutation_set_id’: Identifier for grouping mutations into sets - ‘reference_id’: Identifier for the reference sequence - ‘mutation_string’: String representation of the mutation - ‘position’: Position of the mutation in the sequence - ‘mutation_type’: Type of mutation (‘amino_acid’, ‘codon_dna’, ‘codon_rna’)
Optional columns include: - ‘mutation_set_name’: Name of the mutation set - ‘label’: Label associated with the mutation set - ‘wild_amino_acid’: Wild-type amino acid (for amino acid mutations) - ‘mutant_amino_acid’: Mutant amino acid (for amino acid mutations) - ‘wild_codon’: Wild-type codon (for codon mutations) - ‘mutant_codon’: Mutant codon (for codon mutations) - ‘set_*’: Columns with ‘set_’ prefix for mutation set metadata - ‘mutation_*’: Columns with ‘mutation_’ prefix for individual mutation metadata
reference_sequences (Dict[str, BaseSequence]) – Dictionary mapping reference sequence IDs to their corresponding BaseSequence objects. Must contain all reference sequences referenced in the DataFrame.
name (Optional[str], default=None) – Optional name for the created MutationDataset.
specific_mutation_type (Optional[BaseMutation], default=None) – The type of mutations to create. If None, will infer from first mutation must be provided when the mutation type is neither ‘amino_acid’ nor any ‘codon_*’ type.
- Returns:
A new MutationDataset instance populated with the mutation sets and reference sequences from the DataFrame.
- Return type:
- Raises:
ValueError – If the DataFrame is empty, missing required columns, or references sequences not provided in reference_sequences dict.
Notes
Mutations are grouped by ‘mutation_set_id’ to reconstruct mutation sets
The method automatically determines the appropriate mutation set type
(AminoAcidMutationSet, CodonMutationSet, or generic MutationSet) based on the mutation types within each set - Metadata is extracted from columns with ‘set_’ and ‘mutation_’ prefixes - Only reference sequences that are actually used in the DataFrame are added to the dataset
Examples
>>> import pandas as pd >>> from sequences import ProteinSequence >>> >>> # Create sample DataFrame >>> df = pd.DataFrame({ ... 'mutation_set_id': ['set1', 'set1', 'set2'], ... 'reference_id': ['prot1', 'prot1', 'prot2'], ... 'mutation_string': ['A1V', 'L2P', 'G5R'], ... 'position': [1, 2, 5], ... 'mutation_type': ['amino_acid', 'amino_acid', 'amino_acid'], ... 'wild_amino_acid': ['A', 'L', 'G'], ... 'mutant_amino_acid': ['V', 'P', 'R'], ... 'mutation_set_name': ['variant1', 'variant1', 'variant2'], ... 'label': ['pathogenic', 'pathogenic', 'benign'] ... }) >>> >>> # Define reference sequences >>> ref_seqs = { ... 'prot1': ProteinSequence('ALDEFG', name='protein1'), ... 'prot2': ProteinSequence('MKGLRK', name='protein2') ... } >>> >>> # Create MutationDataset >>> dataset = MutationDataset.from_dataframe(df, ref_seqs, name="my_dataset") >>> print(len(dataset.mutation_sets)) 2
- get_mutation_set_label(mutation_set_index: int) Any [source]
Get the label for a specific mutation set
- get_mutation_set_reference(mutation_set_index: int) str [source]
Get the reference sequence ID for a specific mutation set
- get_position_coverage(reference_id: str | None = None) Dict[str, Any] [source]
Get statistics about position coverage across reference sequences
- get_reference_sequence(sequence_id: str) BaseSequence [source]
Get a reference sequence by ID
- classmethod load(filepath: str, load_type: str | None = None) MutationDataset [source]
Load a dataset from files.
- Parameters:
filepath (str) – Base filepath (with or without extension)
load_type (Optional[str], default=None) – Type of load format (“tidymut”, “dataframe” or “pickle”). If None, auto-detect from file extension.
- Return type:
MutationDataset instance
Example
>>> # Auto-detect from extension >>> dataset = MutationDataset.load("my_study.csv") >>> dataset = MutationDataset.load("my_study.pkl")
>>> # Explicit type >>> dataset = MutationDataset.load("my_study", "dataframe")
- classmethod load_by_reference(base_dir: str | Path, dataset_name: str | None = None, is_zero_based: bool = True) MutationDataset [source]
Load a dataset from tidymut reference-based format.
- Parameters:
base_dir (Union[str, Path]) – Base directory containing reference folders
dataset_name (Optional[str], default=None) – Optional name for the loaded dataset
is_zero_based (bool, default=True) – Whether origin mutation positions are zero-based
- Returns:
MutationDataset instance
Expected directory structure – base_dir/ ├── reference_id_1/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json ├── reference_id_2/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json └── …
- save(filepath: str, save_type: Literal['tidymut', 'pickle', 'dataframe'] | None = 'tidymut')[source]
Save the dataset to files.
- Parameters:
filepath – Base filepath (without extension)
save_type – Type of save format (“tidymut”, “dataframe” or “pickle”)
- For save_type=”dataframe”:
Saves mutations as {filepath}.csv
Saves reference sequences as {filepath}_refs.pkl
Saves metadata as {filepath}_meta.json
- For save_type=”pickle”:
Saves entire dataset as {filepath}.pkl
Example
dataset.save(“my_study”, “dataframe”) # Creates: my_study.csv, my_study_refs.pkl, my_study_meta.json
- save_by_reference(base_dir: str | Path) None [source]
Save dataset by reference_id, creating separate folders for each reference.
- Parameters:
base_dir – Base directory to create reference folders in
- For each reference_id, creates:
{base_dir}/{reference_id}/data.csv: mutation data with columns [mutation_name, mutated_sequence, label]
{base_dir}/{reference_id}/wt.fasta: wild-type reference sequence
{base_dir}/{reference_id}/metadata.json: statistics and metadata for this reference
- set_mutation_set_label(mutation_set_index: int, label: float)[source]
Set the label for a specific mutation set
- class tidymut.core.Pipeline(data: Any = None, name: str | None = None, logging_level: str = 'INFO')[source]
Bases:
object
Pipeline for processing data with pandas-style method chaining
- add_delayed_step(func: Callable, index: int | None = None, *args, **kwargs) Pipeline [source]
Add a delayed step before a specific position in the delayed execution queue.
Performs a similar action to the list.insert() method.
- Parameters:
func (Callable) – Function to add as delayed step
index (Optional[int]) – Position to insert the step. If None, appends to the end. Supports negative indexing.
*args – Arguments to pass to the function
**kwargs – Arguments to pass to the function
- Returns:
Self for method chaining
- Return type:
Examples
>>> # Add step at the beginning >>> pipeline.add_delayed_step(func1, 0)
>>> # Add step at the end (same as delayed_then) >>> pipeline.add_delayed_step(func2)
>>> # Insert step at position 2 >>> pipeline.add_delayed_step(func3, 2)
>>> # Insert step before the last one >>> pipeline.add_delayed_step(func4, -1)
- apply(func: Callable, *args, **kwargs) Pipeline [source]
Apply function and return new Pipeline (functional style)
- property artifacts: Dict[str, Any]
Always return the artifacts dictionary.
This provides direct access to all stored artifacts from pipeline steps.
- property data: Any
Always return the actual data, never PipelineOutput.
This ensures consistent user experience - pipeline.data can always be used with methods like .copy(), .append(), etc.
- delayed_then(func: Callable, *args, **kwargs) Pipeline [source]
Add a function to the delayed execution queue without running it immediately
- execute(steps: int | List[int] | None = None) Pipeline [source]
Execute delayed steps.
- Parameters:
steps (Optional[Union[int, List[int]]]) – Which delayed steps to execute: - None: execute all delayed steps - int: execute the first N delayed steps - List[int]: execute specific delayed steps by index
- Returns:
Self for method chaining
- Return type:
- get_data() Any [source]
Get current data (same as .data property).
Kept for backward compatibility.
- get_step_result(step_index: int | str) Any [source]
Get result from a specific step by index or name
- property has_pending_steps: bool
Check if there are delayed steps waiting to be executed
- classmethod load(filepath: str, format: str = 'pickle', name: str | None = None) Pipeline [source]
Load data from file and create new pipeline
- classmethod load_structured_data(filepath: str, format: str = 'pickle', name: str | None = None) Pipeline [source]
Load structured data from file and create new pipeline
- peek(func: Callable | None = None, prefix: str = '') Pipeline [source]
Inspect data without modifying it (for debugging)
- remove_delayed_step(index_or_name: int | str) Pipeline [source]
Remove a delayed step at the specified index.
- Parameters:
index (int) – Index of the delayed step to remove
- Returns:
Self for method chaining
- Return type:
Pepline
- Raises:
ValueError – If no delayed step is found with the specified index or name
- save_structured_data(filepath: str, format: str = 'pickle') Pipeline [source]
Save structured data (data + artifacts) to file
- store(name: str, extractor: Callable | None = None) Pipeline [source]
Store current data or extracted value as artifact
- property structured_data: PipelineOutput
Return PipelineOutput object with both data and artifacts.
Use this when you need the complete pipeline state for serialization, passing to other systems, or when working with structured data flows.
- then(func: Callable, *args, **kwargs) Pipeline [source]
Apply a function to the current data (pandas.pipe style)
- transform(transformer: Callable, *args, **kwargs) Pipeline [source]
Alias of then, used to define format transformations.
- class tidymut.core.ProteinAlphabet(include_stop: bool = True, include_ambiguous: bool = False)[source]
Bases:
BaseAlphabet
Protein alphabet (20 standard amino acids + stop codon)
- class tidymut.core.ProteinSequence(sequence: str, alphabet: ProteinAlphabet | None = None, name: str | None = None, metadata: Dict | None = None)[source]
Bases:
BaseSequence
Protein sequence with amino acid validation
- class tidymut.core.RNAAlphabet(include_ambiguous: bool = False)[source]
Bases:
BaseAlphabet
RNA alphabet (A, U, C, G)
- class tidymut.core.RNASequence(sequence: str, alphabet: RNAAlphabet | None = None, name: str | None = None, metadata: Dict | None = None)[source]
Bases:
BaseSequence
RNA sequence with nucleotide validation
- back_transcribe() DNASequence [source]
Back-transcribe RNA sequence into DNA sequence
- reverse_complement() RNASequence [source]
Get reverse complement of RNA sequence
- translate(codon_table: CodonTable | None = None, start_at_first_met: bool = False, stop_at_stop_codon: bool = False, require_mod3: bool = True, start: int | None = None, end: int | None = None) ProteinSequence [source]
Translate RNA sequence into amino acid sequence using this codon table.
- Parameters:
codon_table (Optional[CodonTable], default=None) – Codon table to use for translation. If None, uses standard genetic code.
start_at_first_met (bool, default=False) – Start translation at the first start codon if found.
stop_at_stop_codon (bool, default=False) – Stop translation when a stop codon is encountered.
require_mod3 (bool, default=True) – Whether the sequence must be a multiple of 3 in length.
start (Option[int], default=None) – Custom 0-based start position. Overrides start_at_first_met.
end (Option[int], default=None) – Custom 0-based end position. Overrides stop_at_stop_codon.
- Returns:
Translated amino acid sequence.
- Return type:
- tidymut.core.create_pipeline(data: Any, name: str | None = None, **kwargs) Pipeline [source]
Create a new pipeline with initial data
- tidymut.core.multiout_step(**outputs: str)[source]
Decorator for multi-output pipeline functions.
Use this for functions that return multiple values where you want to name and access the outputs separately.
- Parameters:
**outputs (str) – Named outputs. Use ‘main’ to specify which output is the main data flow. If ‘main’ is not specified, the first return value is treated as main.
Examples
>>> # Returns 3 values: main, stats, plot >>> @multiout_step(stats="statistics", plot="visualization") ... def analyze_data(data): ... ... ... return processed_data, stats_dict, plot_object
>>> # Returns 3 values with explicit main designation >>> @multiout_step(main="result", error="error_info", stats="statistics") ... def process_with_metadata(data): ... ... ... return result, error_info, stats
Note
With this decorator, side outputs are returned as a dictionary.
- tidymut.core.pipeline_step(name: str | Callable[..., Any] | None = None)[source]
Decorator for single-output pipeline functions.
Use this for functions that return a single value (including tuples as single values). For multiple outputs, use @multiout_step instead.
- Parameters:
name (Optional[str] or Callable) – Custom name for the step. If None, uses function name. When used as @pipeline_step (without parentheses), this will be the function.
Examples
>>> @pipeline_step ... def process(data): ... return processed_data # Single output
>>> @pipeline_step("process_data") ... def process(data): ... return processed_data # Single output
>>> @pipeline_step() ... def get_coordinates(): ... return (10, 20) # Single tuple output