tidymut.utils.sequence_io module
Utilities for reading and writing sequence files without BioPython dependency.
- tidymut.utils.sequence_io.load_sequences(file_path: str | Path, header_parser: Callable[[str], Tuple[str, Dict[str, str]]] | None = None, format: str | None = None, id_column: str | None = None, sequence_column: str | None = None) Dict[str, str] [source]
Load sequences from various file formats
- Parameters:
file_path (Union[str, Path]) – Path to sequence file
header_parser (Optional[Callable], default=None) – Function to parse FASTA headers (only used for FASTA format)
format (Optional[str], default=None) – File format. If None, inferred from extension. Supported: ‘fasta’, ‘csv’, ‘tsv’, ‘json’
id_column (Optional[str], default=None) – Column name for sequence IDs (CSV/TSV only)
sequence_column (Optional[str], default=None) – Column name for sequences (CSV/TSV only)
- Returns:
Dictionary mapping sequence IDs to sequences
- Return type:
Dict[str, str]
Examples
>>> # Load UniProt FASTA >>> seqs = load_sequences("uniprot.fasta")
>>> # Load FASTA with custom parser >>> seqs = load_sequences("genes.fasta", header_parser=parse_simple_header)
>>> # Load CSV with specified columns >>> seqs = load_sequences("sequences.csv", id_column="protein_id", sequence_column="aa_sequence")
>>> # Load with automatic column detection >>> seqs = load_sequences("sequences.csv")
- tidymut.utils.sequence_io.parse_custom_delimiter_header(delimiter: str = '|', id_position: int = 0) Callable [source]
Create a header parser for custom delimiter-based formats
- Parameters:
delimiter (str, default='|') – Delimiter character to split the header
id_position (int, default=0) – Position of the ID in the split parts (0-based)
- Returns:
Header parser function
- Return type:
Callable
Examples
>>> parser = parse_custom_delimiter_header('|', 1) >>> parser("db|GENE1|other|info") ('GENE1', {'parts': ['db', 'other', 'info']})
- tidymut.utils.sequence_io.parse_fasta(file_path: str | Path, header_parser: Callable[[str], Tuple[str, Dict[str, str]]] | None = None, clean_sequence: bool = True) Dict[str, Dict[str, Any]] [source]
Parse FASTA file with custom header parsing
- Parameters:
file_path (Union[str, Path]) – Path to FASTA file
header_parser (Optional[Callable], default=None) – Function to parse headers. Should take header string and return (id, metadata). If None, uses parse_uniprot_header as default.
clean_sequence (bool, default=True) – Whether to clean sequences (remove whitespace, numbers, etc.)
- Returns:
Dictionary mapping sequence IDs to {‘sequence’: str, ‘metadata’: dict}
- Return type:
Dict[str, Dict[str, Any]]
Examples
>>> # Use default UniProt parser >>> sequences = parse_fasta("proteins.fasta")
>>> # Use NCBI parser >>> sequences = parse_fasta("ncbi_proteins.fasta", header_parser=parse_ncbi_header)
>>> # Use simple parser >>> sequences = parse_fasta("genes.fasta", header_parser=parse_simple_header)
>>> # Custom parser >>> def my_parser(header): ... return header.split('_')[0], {'full_header': header} >>> sequences = parse_fasta("custom.fasta", header_parser=my_parser)
- tidymut.utils.sequence_io.parse_ncbi_header(header: str) Tuple[str, Dict[str, str]] [source]
Parse NCBI FASTA header to extract ID and metadata
- Parameters:
header (str) – FASTA header line (without ‘>’)
- Returns:
(sequence_id, metadata_dict)
- Return type:
Tuple[str, Dict[str, str]]
Examples
>>> parse_ncbi_header("gi|123456|ref|NP_000001.1| protein description [Homo sapiens]") ('NP_000001.1', {'gi': '123456', 'db': 'ref', 'description': 'protein description [Homo sapiens]'}) >>> parse_ncbi_header("NP_000001.1 protein description") ('NP_000001.1', {'description': 'protein description'})
- tidymut.utils.sequence_io.parse_simple_header(header: str) Tuple[str, Dict[str, str]] [source]
Simple header parser that uses the first word as ID
- Parameters:
header (str) – FASTA header line (without ‘>’)
- Returns:
(sequence_id, metadata_dict)
- Return type:
Tuple[str, Dict[str, str]]
Examples
>>> parse_simple_header("GENE1 some description text") ('GENE1', {'description': 'some description text'}) >>> parse_simple_header("GENE1") ('GENE1', {})
- tidymut.utils.sequence_io.parse_uniprot_header(header: str) Tuple[str, Dict[str, str]] [source]
Parse UniProt FASTA header to extract ID and metadata
- Parameters:
header (str) – FASTA header line (without ‘>’)
- Returns:
(sequence_id, metadata_dict)
- Return type:
Tuple[str, Dict[str, str]]
Examples
>>> parse_uniprot_header("sp|P12345|PROT_HUMAN Protein description OS=Homo sapiens") ('P12345', {'db': 'sp', 'entry_name': 'PROT_HUMAN', 'description': 'Protein description OS=Homo sapiens'}) >>> parse_uniprot_header("P12345|PROT_HUMAN Description") ('P12345', {'entry_name': 'PROT_HUMAN', 'description': 'Description'}) >>> parse_uniprot_header("P12345") ('P12345', {})
- tidymut.utils.sequence_io.write_fasta(sequences: Dict[str, str] | Dict[str, Dict[str, Any]], file_path: str | Path, wrap_length: int = 60, header_formatter: Callable[[str, Dict], str] | None = None) None [source]
Write sequences to FASTA file
- Parameters:
sequences (Union[Dict[str, str], Dict[str, Dict[str, Any]]]) – Dictionary mapping IDs to sequences or {‘sequence’: str, ‘metadata’: dict}
file_path (Union[str, Path]) – Output file path
wrap_length (int, default=60) – Line length for sequence wrapping (0 for no wrapping)
header_formatter (Optional[Callable], default=None) – Function to format headers. Takes (id, metadata) and returns header string.
Examples
>>> # Simple sequences >>> seqs = {'GENE1': 'ACDEF', 'GENE2': 'KLMNO'} >>> write_fasta(seqs, 'output.fasta')
>>> # With metadata >>> seqs = { ... 'P12345': { ... 'sequence': 'ACDEF', ... 'metadata': {'description': 'Protein 1', 'organism': 'Human'} ... } ... } >>> write_fasta(seqs, 'output.fasta')