pyllelic package

Submodules

pyllelic.config module

Configuration options for pyllelic.

class pyllelic.config.Config(base_directory: pathlib.Path = PosixPath('/'), promoter_file: pathlib.Path = PosixPath('/promoter.txt'), results_directory: pathlib.Path = PosixPath('/results'), analysis_directory: pathlib.Path = PosixPath('/test'), promoter_start: int = 1293200, promoter_end: int = 1296000, chromosome: str = '5', offset: int = 1298163, viz_backend: str = 'plotly', fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$'))[source]

Bases: object

analysis_directory: pathlib.Path = PosixPath('/test')
base_directory: pathlib.Path = PosixPath('/')
chromosome: str = '5'
fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$')
offset: int = 1298163
promoter_end: int = 1296000
promoter_file: pathlib.Path = PosixPath('/promoter.txt')
promoter_start: int = 1293200
results_directory: pathlib.Path = PosixPath('/results')
viz_backend: str = 'plotly'

pyllelic.process module

Utilities to pre-process and prepare data for use in pyllelic.

exception pyllelic.process.ShellCommandError[source]

Bases: Exception

Error for shell utilities that aren’t installed.

pyllelic.process.bismark(genome: pathlib.Path, fastq: pathlib.Path) str[source]

Helper function to run external bismark tool.

Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs

Parameters
  • genome (Path) – filepath to directory of bismark processed genome files.

  • fastq (Path) – filepath to fastq file to process.

Returns

output from bismark shell command, usually discarded

Return type

str

Raises

ShellCommandError – bismark is not installed.

pyllelic.process.bowtie2_fastq_to_bam(index: pathlib.Path, fastq: pathlib.Path, cores: int) str[source]

Helper function to run external bowtie2-build tool.

Parameters
  • index (Path) – filepath to bowtie index file

  • fastq (Path) – filepath to fastq file to convert to bam

  • cores (int) – number of cores to use for processing

Returns

output from bowtie2 and samtools shell command, usually discarded

Return type

str

Raises

ShellCommandError – bowtie2 is not installed.

pyllelic.process.build_bowtie2_index(fasta: pathlib.Path) str[source]

Helper function to run external bowtie2-build tool.

Parameters

fasta (Path) – filepath to fasta file to build index from

Returns

output from bowtie2-build shell command, usually discarded

Return type

str

Raises

ShellCommandError – bowtie2-build is not installed.

pyllelic.process.fastq_to_list(filepath: pathlib.Path) Optional[List[Bio.SeqRecord.SeqRecord]][source]

Read a .fastq or fastq.gz file into an in-memory record_list.

This is a time and memory intensive operation!

Parameters

filepath (Path) – file path to a fastq.gz file

Returns

list of biopython sequence records from the fastq file

Return type

List[SeqRecord]

pyllelic.process.index_bam(bamfile: pathlib.Path) bool[source]

Helper function to run external samtools index.

Parameters

bamfile (Path) – filepath to bam file

Returns

verification of samtools command, usually discarded

Return type

bool

pyllelic.process.make_records_to_dictionary(record_list: List[Bio.SeqRecord.SeqRecord]) Dict[str, Bio.SeqRecord.SeqRecord][source]
Take in list of biopython SeqRecords and output a dictionary

with keys of the record name.

Parameters

record_list (List[SeqRecord]) – biopython sequence records from a fastq file

Returns

dict of biopython SeqRecords from a fastq file

Return type

Dict[str, SeqRecord]

pyllelic.process.prepare_genome(index: pathlib.Path, aligner: Optional[pathlib.Path] = None) str[source]

Helper function to run external bismark genome preparation tool.

Uses genomes from, e.g.: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs

Parameters
  • index (Path) – filepath to unprocessed genome file.

  • aligner (Optional[Path]) – filepath to bowtie2 alignment program.

Returns

output from genome preparation shell command, usually discarded

Return type

str

Raises

ShellCommandError – bismark_genome_preparation is not installed.

pyllelic.process.retrieve_seq(filename: str, chrom: str, start: int, end: int) None[source]

Retrieve the genomic sequence of interest from UCSC Genome Browser.

Parameters
  • filename (str) – path to store genomic sequence

  • chrom (str) – chromosome of interest, e.g. “chr5”

  • start (int) – start position for region of interest

  • end (int) – end position for region of interest

pyllelic.process.sort_bam(bamfile: pathlib.Path) bool[source]

Helper function to run pysam samtools sort.

Parameters

bamfile (Path) – filepath to bam file

Returns

verification of samtools command, usually discarded

Return type

bool

pyllelic.pyllelic module

pyllelic: a tool for detection of allelic-specific variation in reduced representation bisulfate DNA sequencing.

class pyllelic.pyllelic.AD_stats(sig: bool, stat: Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]], crits: List[Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]])[source]

Bases: NamedTuple

Helper class for NamedTuple results from anderson_darling_test

crits: List[Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]]

Alias for field number 2

sig: bool

Alias for field number 0

stat: Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]

Alias for field number 1

class pyllelic.pyllelic.BamOutput(sam_directory: pathlib.Path, genome_string: str, config: pyllelic.config.Config)[source]

Bases: object

Storage container to process BAM sequencing files and store processed results.

genome_values: Dict[str, str]

dictionary of read files and contents.

Type

Dict[str, str]

name: str

path to bam file analyzed.

Type

str

positions: List[str]

index of genomic positions in the bam file.

Type

“pd.Index

values: Dict[str, str]

dictionary of reads at a given position

Type

Dict[str, str]

class pyllelic.pyllelic.GenomicPositionData(config: pyllelic.config.Config, files_set: List[str])[source]

Bases: object

Class to process reduced representation bisulfite methylation sequencing data.

When initialized, GenomicPositionData reads sequencing file (.bam) locations from a config object, and then automatically performs alignment into BamOutput objects, and then performs methylation analysis, storing the results as QumaResults.

Finally, the aggregate data is analyzed to create some aggregate metrics such as means, modes, and differences (diffs), as well as expose methods for plotting and statistical analysis.

allelic_data: pandas.core.frame.DataFrame

dataframe of Chi-squared p-values.

Type

pd.DataFrame

cell_types: List[str]

list of cell types in the data.

Type

List[str]

config: pyllelic.config.Config

pyllelic config object.

Type

Config

diffs: pandas.core.frame.DataFrame

df of difference mean minus mode methylation values.

Type

pd.DataFrame

file_names: List[str]

list of bam filenames in the data.

Type

List[str]

files_set: List[str]

list of bam files analyzed.

Type

List[str]

static from_pickle(filename: str) pyllelic.pyllelic.GenomicPositionData[source]

Read pickled GenomicPositionData back to an object.

Parameters

filename (str) – filename to read pickle

Returns

GenomicPositionData object

Return type

GenomicPositionData

generate_ad_stats() pandas.core.frame.DataFrame[source]

Generate Anderson-Darling normality statistics for an individual data df.

Returns

df of a-d test statistics

Return type

pd.DataFrame

heatmap(min_values: int, width: int = 800, height: int = 2000, cell_lines: Optional[List[str]] = None, data_type: str = 'means', backend: Optional[str] = None) None[source]

Display a graph figure showing heatmap of mean methylation across cell lines.

Parameters
  • min_values (int) – minimum number of points data must exist at a position

  • width (int) – figure width, defaults to 800

  • height (int) – figure height, defaults to 2000

  • cell_lines (Optional[List[str]]) – set of cell lines to analyze,

  • lines. (defaults to all cell) –

  • data_type (str) – type of data to plot. Can be ‘means’, ‘modes’, ‘diffs’, or ‘pvalue’.

  • backend (Optional[str]) – plotting backend to override default

Raises
  • ValueError – invalid data type

  • ValueError – invalid plotting backend

histogram(cell_line: str, position: str, backend: Optional[str] = None) None[source]

Display a graph figure showing fractional methylation in a given cell line at a given site.

Parameters
  • cell_line (str) – name of cell line

  • position (str) – genomic position

  • backend (Optional[str]) – plotting backend to override default

Raises
  • ValueError – invalid plotting backend

  • ValueError – No data available at that position

individual_data: pandas.core.frame.DataFrame

dataframe of individual methylation values.

Type

pd.DataFrame

means: pandas.core.frame.DataFrame

dataframe of mean methylation values.

Type

pd.DataFrame

modes: pandas.core.frame.DataFrame

dataframe of modes of methylation values.

Type

pd.DataFrame

positions: List[str]

list of genomic positions in the data.

Type

List[str]

quma_results: Dict[str, pyllelic.pyllelic.QumaResult]

list of QumaResults.

Type

Dict[str, QumaResult]

reads_graph(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None[source]

Display a graph figure showing methylation of reads across cell lines.

Parameters
  • cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.

  • backend (Optional[str]) – plotting backend to override default

Raises
  • ValueError – invalid plotting backend

  • ValueError – Unable to plot more than 20 cell lines at once.

save(filename: str = 'output.xlsx') None[source]

Save quma results to an excel file.

Parameters

filename (str) – Filename to save to. Defaults to “output.xlsx”.

save_pickle(filename: str) None[source]

Save GenomicPositionData object as a pickled file.

Parameters

filename (str) – filename to save pickle

sig_methylation_differences(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None[source]

Display a graph figure showing a bar chart of significantly different mean / mode methylation across all or a subset of cell lines.

Parameters
  • cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.

  • backend (Optional[str]) – plotting backend to override default

Raises

ValueError – invalid plotting backend

summarize_allelic_data(cell_lines: Optional[List[str]] = None) pandas.core.frame.DataFrame[source]

Create a dataframe only of likely allelic methylation positions.

Parameters
  • cell_lines (Optional[List[str]]) – set of cell lines to analyze,

  • lines. (defaults to all cell) –

Returns

dataframe of cell lines with likely allelic positions

Return type

pd.DataFrame

write_means_modes_diffs(filename: str) None[source]

Wite out files of means, modes, and diffs for future analysis.

Parameters

filename (str) – desired root filename

class pyllelic.pyllelic.QumaResult(read_files: List[str], genomic_files: List[str], positions: List[str])[source]

Bases: object

Storage container to process and store quma-style methylation results.

quma_output: List[pyllelic.quma.Quma]

list of Quma result objects.

Type

List[quma.Quma]

values: pandas.core.frame.DataFrame

dataframe of quma methylation analysis values.

Type

pd.DataFrame

pyllelic.pyllelic.configure(base_path: str, prom_file: str, prom_start: int, prom_end: int, chrom: str, offset: int, test_dir: Optional[str] = None, fname_pattern: Optional[str] = None, viz_backend: Optional[str] = None, results_dir: Optional[str] = None) pyllelic.config.Config[source]

Helper method to set up all our environmental variables, such as for testing.

Parameters
  • base_path (str) – directory where all processing will occur, put .bam files in “test” sub-directory in this folder

  • prom_file (str) – filename of genmic sequence of promoter region of interest

  • prom_start (int) – start position to analyze in promoter region

  • prom_end (int) – final position to analyze in promoter region

  • chrom (str) – chromosome promoter is located on

  • offset (int) – genomic position of promoter to offset reads

  • test_dir (Optional[str]) – name of test directory where bam files are located

  • fname_pattern (Optional[str]) – regex pattern for processing filenames

  • viz_backend (Optional[str]) – which plotting backend to use

  • results_dir (Optional[str]) – name of results directory

Returns

configuration dataclass instance.

Return type

Config

pyllelic.pyllelic.make_list_of_bam_files(config: pyllelic.config.Config) List[str][source]

Check analysis directory for all valid .bam files.

Parameters

config (Config) – pyllelic configuration options.

Returns

list of files

Return type

list[str]

pyllelic.pyllelic.pyllelic(config: pyllelic.config.Config, files_set: List[str]) pyllelic.pyllelic.GenomicPositionData[source]

Wrapper to call pyllelic routines.

Parameters
  • config (Config) – pyllelic config object.

  • files_set (List[str]) – list of bam files to analyze.

Returns

GenomicPositionData pyllelic object.

Return type

GenomicPositionData

pyllelic.quma module

Tools to quantify methylation in reduced representation bisulfite sequencing reads.

class pyllelic.quma.Fasta(com: str = '', pos: Optional[str] = None, seq: str = '')[source]

Bases: object

Dataclass to wrap fasta results.

com: str = ''
pos: Optional[str] = None
seq: str = ''
class pyllelic.quma.Quma(gfile_contents: str, qfile_contents: str)[source]

Bases: object

Quma methylation analysis parser for bisulfite conversion DNA sequencing.

data: List[pyllelic.quma.Reference]

QUMA Output in object form.

values: str

QUMA output values in tabular form.

class pyllelic.quma.Reference(fasta: pyllelic.quma.Fasta, res: pyllelic.quma.Result, dir: int, gdir: int, exc: int)[source]

Bases: object

Dataclass of quma analysis intermediates.

Includes fasta sequence, quma results, directon of read, genomic direction, and whether result meets exclusion criteria.

dir: int
exc: int
fasta: pyllelic.quma.Fasta
gdir: int
res: pyllelic.quma.Result
class pyllelic.quma.Result(qAli: str = '', gAli: str = '', val: str = '', perc: float = 0.0, pconv: float = 0.0, gap: int = 0, menum: int = 0, unconv: int = 0, conv: int = 0, match: int = 0, aliMis: int = 0, aliLen: int = 0)[source]

Bases: object

Dataclass of quma aligment comparison results.

aliLen: int = 0
aliMis: int = 0
conv: int = 0
gAli: str = ''
gap: int = 0
match: int = 0
menum: int = 0
pconv: float = 0.0
perc: float = 0.0
qAli: str = ''
unconv: int = 0
val: str = ''

pyllelic.visualization module

Utilities to visualize data for use in pyllelic.

pyllelic.visualization._create_heatmap(df: pandas.core.frame.DataFrame, min_values: int, width: int, height: int, title_type: str, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure][source]

Generate a graph figure showing heatmap of mean methylation across cell lines.

Parameters
  • df (pd.DataFrame) – dataframe of mean methylation

  • min_values (int) – minimum number of points data must exist at a position

  • width (int) – figure width

  • height (int) – figure height

  • title_type (str) – type of figure being plotted

  • backend (str) – which plotting backend to use

Returns

plotly or matplotlib figure object

Return type

Union[go.Figure, plt.Figure]

Raises

ValueError – invalid plotting backend

pyllelic.visualization._create_histogram(data: pandas.core.frame.DataFrame, cell_line: str, position: str, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure][source]

Generate a graph figure showing fractional methylation in a given cell line at a given site.

Parameters
  • data (pd.DataFrame) – dataframe of individual data

  • cell_line (str) – name of cell line

  • position (str) – genomic position

  • backend (str) – which plotting backend to use

Returns

plotly or matplotlib figure object

Return type

Union[go.Figure, plt.Figure]

Raises

ValueError – invalid plotting backend provided

pyllelic.visualization._create_methylation_diffs_bar_graph(df: pandas.core.frame.DataFrame, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure][source]

Generate a graph figure showing bar graph of significant methylation across cell lines.

Parameters
  • df (pd.DataFrame) – dataframe of significant methylation positions

  • backend (str) – which plotting backend to use

Returns

plotly or matplotlib figure object

Return type

Union[go.Figure, plt.Figure]

Raises

ValueError – invalid plotting backend

pyllelic.visualization._make_binary(data: Optional[List[int]]) List[int][source]
pyllelic.visualization._make_methyl_df(df: pandas.core.frame.DataFrame, row: str) pandas.core.frame.DataFrame[source]
pyllelic.visualization._make_stacked_fig(df: pandas.core.frame.DataFrame, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure][source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters
  • df (pd.DataFrame) – dataframe of individual read data

  • backend (str) – plotting backend to use

Returns

plotly or matplotlib figure

Return type

Union[go.Figure, plt.Figure]

Raises

ValueError – invalid plotting backend

pyllelic.visualization._make_stacked_mpl_fig(df: pandas.core.frame.DataFrame) matplotlib.figure.Figure[source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters

df (pd.DataFrame) – dataframe of individual read data

Returns

matplotlib figure

Return type

plt.Figure

pyllelic.visualization._make_stacked_plotly_fig(df: pandas.core.frame.DataFrame) plotly.graph_objs._figure.Figure[source]

Generate a graph figure showing methylated and unmethylated reads across cell lines.

Parameters

df (pd.DataFrame) – dataframe of individual read data

Returns

plotly figure

Return type

go.Figure