tidymut.utils.sequence_io module

Utilities for reading and writing sequence files without BioPython dependency.

tidymut.utils.sequence_io.load_sequences(file_path: str | Path, header_parser: Callable[[str], Tuple[str, Dict[str, str]]] | None = None, format: str | None = None, id_column: str | None = None, sequence_column: str | None = None) Dict[str, str][source]

Load sequences from various file formats

Parameters:
  • file_path (Union[str, Path]) – Path to sequence file

  • header_parser (Optional[Callable], default=None) – Function to parse FASTA headers (only used for FASTA format)

  • format (Optional[str], default=None) – File format. If None, inferred from extension. Supported: ‘fasta’, ‘csv’, ‘tsv’, ‘json’

  • id_column (Optional[str], default=None) – Column name for sequence IDs (CSV/TSV only)

  • sequence_column (Optional[str], default=None) – Column name for sequences (CSV/TSV only)

Returns:

Dictionary mapping sequence IDs to sequences

Return type:

Dict[str, str]

Examples

>>> # Load UniProt FASTA
>>> seqs = load_sequences("uniprot.fasta")
>>> # Load FASTA with custom parser
>>> seqs = load_sequences("genes.fasta", header_parser=parse_simple_header)
>>> # Load CSV with specified columns
>>> seqs = load_sequences("sequences.csv", id_column="protein_id", sequence_column="aa_sequence")
>>> # Load with automatic column detection
>>> seqs = load_sequences("sequences.csv")
tidymut.utils.sequence_io.parse_custom_delimiter_header(delimiter: str = '|', id_position: int = 0) Callable[source]

Create a header parser for custom delimiter-based formats

Parameters:
  • delimiter (str, default='|') – Delimiter character to split the header

  • id_position (int, default=0) – Position of the ID in the split parts (0-based)

Returns:

Header parser function

Return type:

Callable

Examples

>>> parser = parse_custom_delimiter_header('|', 1)
>>> parser("db|GENE1|other|info")
('GENE1', {'parts': ['db', 'other', 'info']})
tidymut.utils.sequence_io.parse_fasta(file_path: str | Path, header_parser: Callable[[str], Tuple[str, Dict[str, str]]] | None = None, clean_sequence: bool = True) Dict[str, Dict[str, Any]][source]

Parse FASTA file with custom header parsing

Parameters:
  • file_path (Union[str, Path]) – Path to FASTA file

  • header_parser (Optional[Callable], default=None) – Function to parse headers. Should take header string and return (id, metadata). If None, uses parse_uniprot_header as default.

  • clean_sequence (bool, default=True) – Whether to clean sequences (remove whitespace, numbers, etc.)

Returns:

Dictionary mapping sequence IDs to {‘sequence’: str, ‘metadata’: dict}

Return type:

Dict[str, Dict[str, Any]]

Examples

>>> # Use default UniProt parser
>>> sequences = parse_fasta("proteins.fasta")
>>> # Use NCBI parser
>>> sequences = parse_fasta("ncbi_proteins.fasta", header_parser=parse_ncbi_header)
>>> # Use simple parser
>>> sequences = parse_fasta("genes.fasta", header_parser=parse_simple_header)
>>> # Custom parser
>>> def my_parser(header):
...     return header.split('_')[0], {'full_header': header}
>>> sequences = parse_fasta("custom.fasta", header_parser=my_parser)
tidymut.utils.sequence_io.parse_ncbi_header(header: str) Tuple[str, Dict[str, str]][source]

Parse NCBI FASTA header to extract ID and metadata

Parameters:

header (str) – FASTA header line (without ‘>’)

Returns:

(sequence_id, metadata_dict)

Return type:

Tuple[str, Dict[str, str]]

Examples

>>> parse_ncbi_header("gi|123456|ref|NP_000001.1| protein description [Homo sapiens]")
('NP_000001.1', {'gi': '123456', 'db': 'ref', 'description': 'protein description [Homo sapiens]'})
>>> parse_ncbi_header("NP_000001.1 protein description")
('NP_000001.1', {'description': 'protein description'})
tidymut.utils.sequence_io.parse_simple_header(header: str) Tuple[str, Dict[str, str]][source]

Simple header parser that uses the first word as ID

Parameters:

header (str) – FASTA header line (without ‘>’)

Returns:

(sequence_id, metadata_dict)

Return type:

Tuple[str, Dict[str, str]]

Examples

>>> parse_simple_header("GENE1 some description text")
('GENE1', {'description': 'some description text'})
>>> parse_simple_header("GENE1")
('GENE1', {})
tidymut.utils.sequence_io.parse_uniprot_header(header: str) Tuple[str, Dict[str, str]][source]

Parse UniProt FASTA header to extract ID and metadata

Parameters:

header (str) – FASTA header line (without ‘>’)

Returns:

(sequence_id, metadata_dict)

Return type:

Tuple[str, Dict[str, str]]

Examples

>>> parse_uniprot_header("sp|P12345|PROT_HUMAN Protein description OS=Homo sapiens")
('P12345', {'db': 'sp', 'entry_name': 'PROT_HUMAN', 'description': 'Protein description OS=Homo sapiens'})
>>> parse_uniprot_header("P12345|PROT_HUMAN Description")
('P12345', {'entry_name': 'PROT_HUMAN', 'description': 'Description'})
>>> parse_uniprot_header("P12345")
('P12345', {})
tidymut.utils.sequence_io.write_fasta(sequences: Dict[str, str] | Dict[str, Dict[str, Any]], file_path: str | Path, wrap_length: int = 60, header_formatter: Callable[[str, Dict], str] | None = None) None[source]

Write sequences to FASTA file

Parameters:
  • sequences (Union[Dict[str, str], Dict[str, Dict[str, Any]]]) – Dictionary mapping IDs to sequences or {‘sequence’: str, ‘metadata’: dict}

  • file_path (Union[str, Path]) – Output file path

  • wrap_length (int, default=60) – Line length for sequence wrapping (0 for no wrapping)

  • header_formatter (Optional[Callable], default=None) – Function to format headers. Takes (id, metadata) and returns header string.

Examples

>>> # Simple sequences
>>> seqs = {'GENE1': 'ACDEF', 'GENE2': 'KLMNO'}
>>> write_fasta(seqs, 'output.fasta')
>>> # With metadata
>>> seqs = {
...     'P12345': {
...         'sequence': 'ACDEF',
...         'metadata': {'description': 'Protein 1', 'organism': 'Human'}
...     }
... }
>>> write_fasta(seqs, 'output.fasta')