API

Dataset objects

Detailed API

class itpseq.DataSet(data_path: Path = '.', result_path: Path | None = None, samples: dict | None = None, keys=None, ref_labels: str | tuple | None = 'noa', cache_path=None, file_pattern=None, aafile_pattern=None)

Loads an iTP-Seq dataset and provides methods for analyzing and visualizing the data.

A DataSet object is constructed to handle iTP-Seq Samples with their respective Replicates. By default, it infers the files to uses in the provided directory by looking for “*.processed.json” files produced during the initial step of pre-processing and filtering the fastq files. It uses the pattern of the file names to group the Replicates into a Sample, and to define which condition is the reference in the DataSet (the Sample with name “noa” by default).

data_path

Path to the data directory containing the output files from the fastq pre-processing.

Type:

str or Path

result_path

Path to the directory where the results of the analysis will be saved.

Type:

str or Path

samples

Dictionary of Samples in the DataSet. By default, it is None and will be populated automatically.

Type:

dict or None

keys

Properties in the file name to use for identifying the reference.

Type:

tuple

ref_labels

Specifies the reference: e.g. ‘noa’ or ((‘sample’, ‘noa’),)

Type:

str or tuple

cache_path

Path used to cache intermediate results. By default, this creates a subdirectory called “cache” in the result_path directory.

Type:

str or Path

file_pattern

Regex pattern used to identify the sample files in the data_path directory. If None, defaults to r’(?P<lib_type>[^_]+)_(?P<sample>[^_d]+)(?P<replicate>d+).processed.json’ which matches files like nnn15_noa1.processed.json, nnn15_tcx2.processed.json, etc.

Type:

str

aafile_pattern

Pattern used to identify the amino acid files in the data_path directory. It will use the values captured in the file_pattern regex to construct the file names. If None, defaults to ‘{lib_type}_{sample}{replicate}_aa.processed.txt’

Type:

str

Examples

Creating a DataSet from a simple antibiotic treatment (tcx) vs no treatement (noa) with 3 replicates each (1, 2, 3).

Load a dataset from the current directory, inferring the samples automatically
>>> from itpseq import DataSet
>>> data = DataSet(data_path='.')
>>> data
DataSet(data_path=PosixPath('.'),
   reference=Sample(noa:[1, 2, 3]),
   samples=[Sample(noa:[1, 2, 3]),
            Sample(tcx:[1, 2, 3], ref: noa)],
   )
Compute a standard report and export it as PDF
>>> data.report('my_experiment.pdf')
Display a graph of the inverse-toeprints lengths for each sample
>>> data.itp_len_plot(row='sample')
Attributes:
samples_with_ref

Methods

DE([pos])

Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference

infos([html])

Displays summary information about the dataset sequences.

itoeprint

itp_len_plot

reorder_samples

report

DE(pos='E:A', **kwargs)

Computes the log2-FoldChange for each motif described by pos for each sample in the DataSet relative to their reference

infos(html=False)

Displays summary information about the dataset sequences.

class itpseq.Replicate(*, replicate: str | None = None, filename: Path | None = None, sample: Sample | None = None, labels: dict | None = None, **kwargs)

Replicate instances represent a specific biological or experimental replicate of a Sample.

The purpose of the class is to handle, process, and analyze data corresponding to a replicate. Replicate objects provide methods to load associated data, compute statistical measures, and generate graphical representations such as sequence logos.

filename

Path to the file associated with the replicate. This file is expected to contain raw data relevant to the replicate.

Type:

Optional[Path]

sample

The sample object this replicate belongs to.

Type:

Optional[Sample]

replicate

Identifier or label for the replicate (e.g., “1”).

Type:

Optional[str]

labels

Dictionary of labels or metadata associated with the replicate.

Type:

Optional[dict]

name

Name of the sample, derived from sample.name if provided.

Type:

str

dataset

The DataSet the Sample belongs to, derived from sample.dataset if provided.

Type:

Any

kwargs

Additional keyword arguments and metadata stored as “meta” during initialization.

Type:

dict

Methods

logo([logo_kwargs, ax, fMet, type])

Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.

get_counts

load_data

Generates a sequence logo based on the aligned inverse-toeprints, using the logomaker library.

Parameters:
  • logo_kwargs (dict, optional) – Additional keyword arguments passed to logomaker.Logo for customizing the sequence logo. Defaults to {‘color_scheme’: ‘NajafabadiEtAl2017’}.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing matplotlib Axes to draw the logo on. A new Axes is created if not provided.

  • fMet (bool, optional) – If False, removes m (formyl-methionine / start codon) from the alignment when building the logo. Defaults to False.

  • type (str, optional) – The transformation type applied to the counts matrix. Possible values include: - ‘information’ for information content. - ‘probability’ for probabilities. Defaults to ‘information’.

  • **kwargs (dict) – Additional keyword arguments passed to filter the input data (e.g., pos, min_peptide, max_peptide…).

Returns:

A logomaker.Logo object representing the sequence logo.

Return type:

logomaker.Logo

Notes

  • Sequence alignment data is first converted to a counts matrix via the logomaker.alignment_to_matrix method.

  • The ribosomal site corresponding to each position is annotated on the x-axis.

  • Transformation of the counts matrix (e.g., counts to information) is performed using logomaker.transform_matrix.

Examples

# Simple logo plot with default settings logo = obj.logo()

# Logo plot with min_peptide filtering logo = obj.logo(min_peptide=3)

# Logo plot with custom transformation type and filtering logo = obj.logo(type=’probability’, min_peptides=2, fMet=True)

class itpseq.Sample(*, labels: dict, reference=None, dataset=None, data=None, keys=('sample',), **kwargs)

Represents a sample in a dataset, its replicates, reference, and associated metadata.

The Sample class is used to encapsulate information and behavior related to samples in a dataset. It manages details like labels, references, replicates, and metadata, and provides methods for analyzing replicates, performing differential enrichment analysis, and creating visualizations.

Attributes:
name_ref
name_vs_ref

Methods

get_counts_ratio([pos, factor, exclude_empty])

get_counts_ratio_pos([pos])

Computes a DataFrame with the enrichment ratios for each ribosome position.

hmap([r, c, pos, col, transform, cmap, ...])

Generates a heatmap of enrichment for combinations of 2 positions.

hmap_grid([pos, col, transform, cmap, vmax, ...])

Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.

hmap_pos([pos, cmap, vmax, center, ax])

Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.

itp_len_plot([ax, min_codon, max_codon, ...])

Generates a line plot of inverse-toeprint (ITP) counts per length.

DE

all_logos

get_counts

infos

itoeprint

load_replicates

logo

volcano

get_counts_ratio_pos(pos=None, **kwargs)

Computes a DataFrame with the enrichment ratios for each ribosome position.

This method calculates the enrichment for amino acids at the specified positions on the ribosome and organizes the results into a DataFrame. Each row of the DataFrame corresponds to a ribosome position.

Parameters:
  • pos (iterable, optional) – An iterable of ribosome positions for which to compute enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)). If not provided, defaults to (‘-2’, ‘E’, ‘P’, ‘A’).

  • how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.

  • **kwargs (dict, optional) – Additional parameters to filter the data or customize the ratio computations.

Returns:

A DataFrame where rows correspond to ribosome positions and columns correspond to amino acids (ordered by a predefined amino acid sequence). The values in the DataFrame represent the enrichment ratios for each position and amino acid.

Return type:

pandas.DataFrame

hmap(r=None, c=None, *, pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, ax=None, heatmap_kwargs=None, **kwargs)

Generates a heatmap of enrichment for combinations of 2 positions.

Parameters:
  • r (str) – The row position on the ribosome for the heatmap.

  • c (str) – The column position on the ribosome for the heatmap.

  • pos (str or list) – Either a specific position in the form “r:c” or a list of positions to analyze.

  • how (str) – Defines the method to compute the counts (e.g., ‘mean’, ‘sum’, ‘count’). If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.

  • col (str) – The dataset column used for computations.

  • transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmap.

  • cmap (str or matplotlib.colors.Colormap) – The colormap to use for the heatmap visualization.

  • vmax (float, optional) – The maximum value for color scaling in the heatmap.

  • center (float, optional) – The midpoint value for centering the colormap.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. If not provided, a new figure and axes are created.

  • heatmap_kwargs (dict) – Parameters passed to the sns.heatmap method

  • kwargs (dict) – Additional parameters used to filter the dataset. This allows for fine-tuning of the data before generating the heatmap.

Returns:

The heatmap axes object containing the visualization.

Return type:

matplotlib.axes.Axes

hmap_grid(pos=None, col='auto', transform=<ufunc 'log2'>, cmap='vlag', vmax=None, center=None, **kwargs)

Creates a grid of heatmaps for all combinations of ribosome positions passed in pos.

Each cell in the upper triangle of the grid represents a heatmap of enrichment between two positions, with the visualization parameters inherited from the hmap method.

Parameters:
  • pos (iterable, optional) – An iterable of ribosome positions for generating combinations (e.g., [‘-2’, ‘E’, ‘P’, ‘A’]). If not provided, defaults to the set of positions [‘-2’, ‘E’, ‘P’, ‘A’].

  • how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded.

  • col (str, optional) – The dataset column used for computations. Displays the enrichment by default.

  • transform (callable, optional) – A function or callable to apply to the dataset before generating the heatmaps. Defaults to numpy.log2.

  • cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualizations. Defaults to ‘vlag’.

  • vmax (float, optional) – The maximum value for color scaling in the heatmaps.

  • center (float, optional) – The midpoint value for centering the colormap.

  • kwargs (key, value pairings) – Additional parameters used to filter the dataset or control heatmap generation via the hmap method.

Returns:

The figure object containing the grid of heatmaps.

Return type:

matplotlib.figure.Figure

hmap_pos(pos=None, *, cmap='vlag', vmax=None, center=0, ax=None, **kwargs)

Generates a heatmap of enrichment ratios for amino acid positions across ribosome sites.

This method visualizes the enrichment ratios as a heatmap, where the rows correspond to different ribosome positions and the columns represent amino acids.

Parameters:
  • pos (tuple, optional) – Ribosome positions for which to compute and visualize enrichment ratios (e.g., (‘-2’, ‘E’, ‘P’, ‘A’)).

  • how (str, optional) – If ‘aax’ is provided, sequences with stop codons in the peptide are excluded. Default is ‘aax’.

  • col (str, optional) – The DataFrame column to utilize for enrichment visualization. Defaults to ‘auto’.

  • transform (callable, optional) – A function or callable to apply to the enrichment matrix before plotting. Defaults to numpy.log2.

  • cmap (str or matplotlib.colors.Colormap, optional) – The colormap to use for the heatmap visualization. Defaults to ‘vlag’.

  • vmax (float, optional) – The maximum value for color scaling in the heatmap. If not provided, it defaults to the maximum absolute value in the enrichment matrix.

  • center (float, optional) – The midpoint of the colormap. Defaults to 0.

  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes for the plot. A new figure and axes are created if not provided.

  • **kwargs (dict, optional) – Additional parameters to customize the enrichment computation or filtering.

Returns:

The axes object containing the heatmap visualization.

Return type:

matplotlib.axes.Axes

Notes

  • The rows of the heatmap correspond to ribosome positions, while the columns represent amino acids.

  • Tick labels are styled using the aa_colors dictionary to match the biochemical categories of amino acids.

  • Enrichment ratios are automatically log2-transformed by default.

itp_len_plot(ax=None, min_codon=0, max_codon=10, limit=100, norm=False)

Generates a line plot of inverse-toeprint (ITP) counts per length.

This method uses the output of itp_len to create a line plot showing the counts of inverse-toeprints across lengths for each replicate. Optionally, counts can be normalized (per million reads), and the plotted lengths can be limited.

Parameters:
  • ax (matplotlib.axes.Axes, optional) – Pre-existing axes to draw the plot on. A new figure and axes are created if not provided.

  • min_codon (int, optional) – The minimum codon position to annotate on the plot. Defaults to 0.

  • max_codon (int, optional) – The maximum codon position to annotate on the plot. Defaults to 10.

  • limit (int, optional) – The maximum length to include in the plot. Defaults to 100.

  • norm (bool, optional) – Whether to normalize counts to reads per million. Defaults to False.

Returns:

The axes object containing the plotted lineplot.

Return type:

matplotlib.axes.Axes

Notes

  • The x-axis represents the distance from the 3’ end of the inverse-toeprint in nucleotides.

  • The y-axis shows the counts of inverse-toeprints, either absolute or normalized per million reads.

  • Each replicate is plotted independently and distinguished by the hue attribute in the plot.

property itp_len

Combines the counts of inverse-toeprints (ITPs) for each length across all replicates.

This method extracts the counts of inverse-toeprints for each length from the metadata of each replicate and combines them into a single DataFrame, keeping the data for each replicate independent.

Returns:

A DataFrame with the following columns: - length : int

The length of the inverse-toeprints.

  • replicatestr

    The replicate identifier.

  • countint

    The count of inverse-toeprints of the given length for the replicate.

  • samplestr

    The name of the sample this data belongs to.

Return type:

pandas.DataFrame