tidymut.core.dataset module
- class tidymut.core.dataset.MutationDataset(name: str | None = None)[source]
Bases:
object
Dataset container for cleaned mutation data with multiple reference sequences.
All mutation sets must be linked to a reference sequence when added to the dataset. This ensures data integrity and enables proper validation and analysis.
- add_mutation_set(mutation_set: MutationSet, reference_id: str, label: float | None = None)[source]
Add a mutation set to the dataset, linking to a reference sequence
- add_mutation_sets(mutation_sets: Sequence[MutationSet], reference_ids: Sequence[str], labels: Sequence[float] | None = None)[source]
Add multiple mutation sets to the dataset
- add_reference_sequence(sequence_id: str, sequence: BaseSequence)[source]
Add a reference sequence with a unique identifier
- convert_codon_to_amino_acid_sets(convert_labels: bool = False) MutationDataset [source]
Convert all codon mutation sets to amino acid mutation sets
- Parameters:
convert_labels – Whether to save the labels with the mutation sets (default: False)
- filter_by_effect_type(effect_type: str) MutationDataset [source]
Filter dataset by amino acid mutation effect type (synonymous, missense, nonsense)
- filter_by_mutation_type(mutation_type: Type[BaseMutation]) MutationDataset [source]
Filter dataset by mutation type
- filter_by_reference(reference_id: str) MutationDataset [source]
Filter dataset to only include mutation sets from a specific reference sequence
- classmethod from_dataframe(df: pd.DataFrame, reference_sequences: Dict[str, BaseSequence], name: str | None = None, specific_mutation_type: Type[BaseMutation] | None = None) MutationDataset [source]
Create a MutationDataset from a DataFrame containing mutation data.
This method reconstructs a MutationDataset from a flattened DataFrame representation, typically used for loading saved mutation datasets from files. The DataFrame should contain mutation information with each row representing a single mutation within mutation sets.
- Parameters:
df (pd.DataFrame) –
DataFrame containing mutation data with the following required columns: - ‘mutation_set_id’: Identifier for grouping mutations into sets - ‘reference_id’: Identifier for the reference sequence - ‘mutation_string’: String representation of the mutation - ‘position’: Position of the mutation in the sequence - ‘mutation_type’: Type of mutation (‘amino_acid’, ‘codon_dna’, ‘codon_rna’)
Optional columns include: - ‘mutation_set_name’: Name of the mutation set - ‘label’: Label associated with the mutation set - ‘wild_amino_acid’: Wild-type amino acid (for amino acid mutations) - ‘mutant_amino_acid’: Mutant amino acid (for amino acid mutations) - ‘wild_codon’: Wild-type codon (for codon mutations) - ‘mutant_codon’: Mutant codon (for codon mutations) - ‘set_*’: Columns with ‘set_’ prefix for mutation set metadata - ‘mutation_*’: Columns with ‘mutation_’ prefix for individual mutation metadata
reference_sequences (Dict[str, BaseSequence]) – Dictionary mapping reference sequence IDs to their corresponding BaseSequence objects. Must contain all reference sequences referenced in the DataFrame.
name (Optional[str], default=None) – Optional name for the created MutationDataset.
specific_mutation_type (Optional[BaseMutation], default=None) – The type of mutations to create. If None, will infer from first mutation must be provided when the mutation type is neither ‘amino_acid’ nor any ‘codon_*’ type.
- Returns:
A new MutationDataset instance populated with the mutation sets and reference sequences from the DataFrame.
- Return type:
- Raises:
ValueError – If the DataFrame is empty, missing required columns, or references sequences not provided in reference_sequences dict.
Notes
Mutations are grouped by ‘mutation_set_id’ to reconstruct mutation sets
The method automatically determines the appropriate mutation set type
(AminoAcidMutationSet, CodonMutationSet, or generic MutationSet) based on the mutation types within each set - Metadata is extracted from columns with ‘set_’ and ‘mutation_’ prefixes - Only reference sequences that are actually used in the DataFrame are added to the dataset
Examples
>>> import pandas as pd >>> from sequences import ProteinSequence >>> >>> # Create sample DataFrame >>> df = pd.DataFrame({ ... 'mutation_set_id': ['set1', 'set1', 'set2'], ... 'reference_id': ['prot1', 'prot1', 'prot2'], ... 'mutation_string': ['A1V', 'L2P', 'G5R'], ... 'position': [1, 2, 5], ... 'mutation_type': ['amino_acid', 'amino_acid', 'amino_acid'], ... 'wild_amino_acid': ['A', 'L', 'G'], ... 'mutant_amino_acid': ['V', 'P', 'R'], ... 'mutation_set_name': ['variant1', 'variant1', 'variant2'], ... 'label': ['pathogenic', 'pathogenic', 'benign'] ... }) >>> >>> # Define reference sequences >>> ref_seqs = { ... 'prot1': ProteinSequence('ALDEFG', name='protein1'), ... 'prot2': ProteinSequence('MKGLRK', name='protein2') ... } >>> >>> # Create MutationDataset >>> dataset = MutationDataset.from_dataframe(df, ref_seqs, name="my_dataset") >>> print(len(dataset.mutation_sets)) 2
- get_mutation_set_label(mutation_set_index: int) Any [source]
Get the label for a specific mutation set
- get_mutation_set_reference(mutation_set_index: int) str [source]
Get the reference sequence ID for a specific mutation set
- get_position_coverage(reference_id: str | None = None) Dict[str, Any] [source]
Get statistics about position coverage across reference sequences
- get_reference_sequence(sequence_id: str) BaseSequence [source]
Get a reference sequence by ID
- classmethod load(filepath: str, load_type: str | None = None) MutationDataset [source]
Load a dataset from files.
- Parameters:
filepath (str) – Base filepath (with or without extension)
load_type (Optional[str], default=None) – Type of load format (“tidymut”, “dataframe” or “pickle”). If None, auto-detect from file extension.
- Return type:
MutationDataset instance
Example
>>> # Auto-detect from extension >>> dataset = MutationDataset.load("my_study.csv") >>> dataset = MutationDataset.load("my_study.pkl")
>>> # Explicit type >>> dataset = MutationDataset.load("my_study", "dataframe")
- classmethod load_by_reference(base_dir: str | Path, dataset_name: str | None = None, is_zero_based: bool = True) MutationDataset [source]
Load a dataset from tidymut reference-based format.
- Parameters:
base_dir (Union[str, Path]) – Base directory containing reference folders
dataset_name (Optional[str], default=None) – Optional name for the loaded dataset
is_zero_based (bool, default=True) – Whether origin mutation positions are zero-based
- Returns:
MutationDataset instance
Expected directory structure – base_dir/ ├── reference_id_1/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json ├── reference_id_2/ │ ├── data.csv │ ├── wt.fasta │ └── metadata.json └── …
- save(filepath: str, save_type: Literal['tidymut', 'pickle', 'dataframe'] | None = 'tidymut')[source]
Save the dataset to files.
- Parameters:
filepath – Base filepath (without extension)
save_type – Type of save format (“tidymut”, “dataframe” or “pickle”)
- For save_type=”dataframe”:
Saves mutations as {filepath}.csv
Saves reference sequences as {filepath}_refs.pkl
Saves metadata as {filepath}_meta.json
- For save_type=”pickle”:
Saves entire dataset as {filepath}.pkl
Example
dataset.save(“my_study”, “dataframe”) # Creates: my_study.csv, my_study_refs.pkl, my_study_meta.json
- save_by_reference(base_dir: str | Path) None [source]
Save dataset by reference_id, creating separate folders for each reference.
- Parameters:
base_dir – Base directory to create reference folders in
- For each reference_id, creates:
{base_dir}/{reference_id}/data.csv: mutation data with columns [mutation_name, mutated_sequence, label]
{base_dir}/{reference_id}/wt.fasta: wild-type reference sequence
{base_dir}/{reference_id}/metadata.json: statistics and metadata for this reference
- set_mutation_set_label(mutation_set_index: int, label: float)[source]
Set the label for a specific mutation set