tidymut.core.dataset module

class tidymut.core.dataset.MutationDataset(name: str | None = None)[source]

Bases: object

Dataset container for cleaned mutation data with multiple reference sequences.

All mutation sets must be linked to a reference sequence when added to the dataset. This ensures data integrity and enables proper validation and analysis.

add_mutation_set(mutation_set: MutationSet, reference_id: str, label: float | None = None)[source]

Add a mutation set to the dataset, linking to a reference sequence

add_mutation_sets(mutation_sets: Sequence[MutationSet], reference_ids: Sequence[str], labels: Sequence[float] | None = None)[source]

Add multiple mutation sets to the dataset

add_reference_sequence(sequence_id: str, sequence: BaseSequence)[source]

Add a reference sequence with a unique identifier

convert_codon_to_amino_acid_sets(convert_labels: bool = False) MutationDataset[source]

Convert all codon mutation sets to amino acid mutation sets

Parameters:

convert_labels – Whether to save the labels with the mutation sets (default: False)

filter_by_effect_type(effect_type: str) MutationDataset[source]

Filter dataset by amino acid mutation effect type (synonymous, missense, nonsense)

filter_by_mutation_type(mutation_type: Type[BaseMutation]) MutationDataset[source]

Filter dataset by mutation type

filter_by_reference(reference_id: str) MutationDataset[source]

Filter dataset to only include mutation sets from a specific reference sequence

classmethod from_dataframe(df: pd.DataFrame, reference_sequences: Dict[str, BaseSequence], name: str | None = None, specific_mutation_type: Type[BaseMutation] | None = None) MutationDataset[source]

Create a MutationDataset from a DataFrame containing mutation data.

This method reconstructs a MutationDataset from a flattened DataFrame representation, typically used for loading saved mutation datasets from files. The DataFrame should contain mutation information with each row representing a single mutation within mutation sets.

Parameters:
  • df (pd.DataFrame) –

    DataFrame containing mutation data with the following required columns: - ‘mutation_set_id’: Identifier for grouping mutations into sets - ‘reference_id’: Identifier for the reference sequence - ‘mutation_string’: String representation of the mutation - ‘position’: Position of the mutation in the sequence - ‘mutation_type’: Type of mutation (‘amino_acid’, ‘codon_dna’, ‘codon_rna’)

    Optional columns include: - ‘mutation_set_name’: Name of the mutation set - ‘label’: Label associated with the mutation set - ‘wild_amino_acid’: Wild-type amino acid (for amino acid mutations) - ‘mutant_amino_acid’: Mutant amino acid (for amino acid mutations) - ‘wild_codon’: Wild-type codon (for codon mutations) - ‘mutant_codon’: Mutant codon (for codon mutations) - ‘set_*’: Columns with ‘set_’ prefix for mutation set metadata - ‘mutation_*’: Columns with ‘mutation_’ prefix for individual mutation metadata

  • reference_sequences (Dict[str, BaseSequence]) – Dictionary mapping reference sequence IDs to their corresponding BaseSequence objects. Must contain all reference sequences referenced in the DataFrame.

  • name (Optional[str], default=None) – Optional name for the created MutationDataset.

  • specific_mutation_type (Optional[BaseMutation], default=None) – The type of mutations to create. If None, will infer from first mutation must be provided when the mutation type is neither ‘amino_acid’ nor any ‘codon_*’ type.

Returns:

A new MutationDataset instance populated with the mutation sets and reference sequences from the DataFrame.

Return type:

MutationDataset

Raises:

ValueError – If the DataFrame is empty, missing required columns, or references sequences not provided in reference_sequences dict.

Notes

  • Mutations are grouped by ‘mutation_set_id’ to reconstruct mutation sets

  • The method automatically determines the appropriate mutation set type

(AminoAcidMutationSet, CodonMutationSet, or generic MutationSet) based on the mutation types within each set - Metadata is extracted from columns with ‘set_’ and ‘mutation_’ prefixes - Only reference sequences that are actually used in the DataFrame are added to the dataset

Examples

>>> import pandas as pd
>>> from sequences import ProteinSequence
>>>
>>> # Create sample DataFrame
>>> df = pd.DataFrame({
...     'mutation_set_id': ['set1', 'set1', 'set2'],
...     'reference_id': ['prot1', 'prot1', 'prot2'],
...     'mutation_string': ['A1V', 'L2P', 'G5R'],
...     'position': [1, 2, 5],
...     'mutation_type': ['amino_acid', 'amino_acid', 'amino_acid'],
...     'wild_amino_acid': ['A', 'L', 'G'],
...     'mutant_amino_acid': ['V', 'P', 'R'],
...     'mutation_set_name': ['variant1', 'variant1', 'variant2'],
...     'label': ['pathogenic', 'pathogenic', 'benign']
... })
>>>
>>> # Define reference sequences
>>> ref_seqs = {
...     'prot1': ProteinSequence('ALDEFG', name='protein1'),
...     'prot2': ProteinSequence('MKGLRK', name='protein2')
... }
>>>
>>> # Create MutationDataset
>>> dataset = MutationDataset.from_dataframe(df, ref_seqs, name="my_dataset")
>>> print(len(dataset.mutation_sets))
2
get_mutation_set_label(mutation_set_index: int) Any[source]

Get the label for a specific mutation set

get_mutation_set_reference(mutation_set_index: int) str[source]

Get the reference sequence ID for a specific mutation set

get_position_coverage(reference_id: str | None = None) Dict[str, Any][source]

Get statistics about position coverage across reference sequences

get_reference_sequence(sequence_id: str) BaseSequence[source]

Get a reference sequence by ID

get_statistics() Dict[str, Any][source]

Get basic statistics about the dataset

list_reference_sequences() List[str][source]

Get list of all reference sequence IDs

classmethod load(filepath: str, load_type: str | None = None) MutationDataset[source]

Load a dataset from files.

Parameters:
  • filepath (str) – Base filepath (with or without extension)

  • load_type (Optional[str], default=None) – Type of load format (“tidymut”, “dataframe” or “pickle”). If None, auto-detect from file extension.

Return type:

MutationDataset instance

Example

>>> # Auto-detect from extension
>>> dataset = MutationDataset.load("my_study.csv")
>>> dataset = MutationDataset.load("my_study.pkl")
>>> # Explicit type
>>> dataset = MutationDataset.load("my_study", "dataframe")
classmethod load_by_reference(base_dir: str | Path, dataset_name: str | None = None, is_zero_based: bool = True) MutationDataset[source]

Load a dataset from tidymut reference-based format.

Parameters:
  • base_dir (Union[str, Path]) – Base directory containing reference folders

  • dataset_name (Optional[str], default=None) – Optional name for the loaded dataset

  • is_zero_based (bool, default=True) – Whether origin mutation positions are zero-based

Returns:

  • MutationDataset instance

  • Expected directory structure – base_dir/ ├── reference_id_1/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json ├── reference_id_2/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json └── …

remove_mutation_set(mutation_set_index: int)[source]

Remove a mutation set from the dataset

remove_reference_sequence(sequence_id: str)[source]

Remove a reference sequence

save(filepath: str, save_type: Literal['tidymut', 'pickle', 'dataframe'] | None = 'tidymut')[source]

Save the dataset to files.

Parameters:
  • filepath – Base filepath (without extension)

  • save_type – Type of save format (“tidymut”, “dataframe” or “pickle”)

For save_type=”dataframe”:
  • Saves mutations as {filepath}.csv

  • Saves reference sequences as {filepath}_refs.pkl

  • Saves metadata as {filepath}_meta.json

For save_type=”pickle”:
  • Saves entire dataset as {filepath}.pkl

Example

dataset.save(“my_study”, “dataframe”) # Creates: my_study.csv, my_study_refs.pkl, my_study_meta.json

save_by_reference(base_dir: str | Path) None[source]

Save dataset by reference_id, creating separate folders for each reference.

Parameters:

base_dir – Base directory to create reference folders in

For each reference_id, creates:
  • {base_dir}/{reference_id}/data.csv: mutation data with columns [mutation_name, mutated_sequence, label]

  • {base_dir}/{reference_id}/wt.fasta: wild-type reference sequence

  • {base_dir}/{reference_id}/metadata.json: statistics and metadata for this reference

set_mutation_set_label(mutation_set_index: int, label: float)[source]

Set the label for a specific mutation set

set_mutation_set_reference(mutation_set_index: int, reference_id: str)[source]

Set the reference sequence for a specific mutation set

to_dataframe() DataFrame[source]

Convert dataset to pandas DataFrame

validate_against_references() Dict[str, Any][source]

Validate mutations against their reference sequences