pyllelic package
Submodules
pyllelic.config module
Configuration options for pyllelic.
- class pyllelic.config.Config(base_directory: pathlib.Path = PosixPath('/'), promoter_file: pathlib.Path = PosixPath('/promoter.txt'), results_directory: pathlib.Path = PosixPath('/results'), analysis_directory: pathlib.Path = PosixPath('/test'), promoter_start: int = 1293200, promoter_end: int = 1296000, chromosome: str = '5', offset: int = 1298163, viz_backend: str = 'plotly', fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$'))[source]
Bases:
object
- analysis_directory: pathlib.Path = PosixPath('/test')
- base_directory: pathlib.Path = PosixPath('/')
- chromosome: str = '5'
- fname_pattern: Pattern[str] = re.compile('^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$')
- offset: int = 1298163
- promoter_end: int = 1296000
- promoter_file: pathlib.Path = PosixPath('/promoter.txt')
- promoter_start: int = 1293200
- results_directory: pathlib.Path = PosixPath('/results')
- viz_backend: str = 'plotly'
pyllelic.process module
Utilities to pre-process and prepare data for use in pyllelic.
- exception pyllelic.process.ShellCommandError[source]
Bases:
Exception
Error for shell utilities that aren’t installed.
- pyllelic.process.bismark(genome: pathlib.Path, fastq: pathlib.Path) str [source]
Helper function to run external bismark tool.
Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs
- Parameters
genome (Path) – filepath to directory of bismark processed genome files.
fastq (Path) – filepath to fastq file to process.
- Returns
output from bismark shell command, usually discarded
- Return type
str
- Raises
ShellCommandError – bismark is not installed.
- pyllelic.process.bowtie2_fastq_to_bam(index: pathlib.Path, fastq: pathlib.Path, cores: int) str [source]
Helper function to run external bowtie2-build tool.
- Parameters
index (Path) – filepath to bowtie index file
fastq (Path) – filepath to fastq file to convert to bam
cores (int) – number of cores to use for processing
- Returns
output from bowtie2 and samtools shell command, usually discarded
- Return type
str
- Raises
ShellCommandError – bowtie2 is not installed.
- pyllelic.process.build_bowtie2_index(fasta: pathlib.Path) str [source]
Helper function to run external bowtie2-build tool.
- Parameters
fasta (Path) – filepath to fasta file to build index from
- Returns
output from bowtie2-build shell command, usually discarded
- Return type
str
- Raises
ShellCommandError – bowtie2-build is not installed.
- pyllelic.process.fastq_to_list(filepath: pathlib.Path) Optional[List[Bio.SeqRecord.SeqRecord]] [source]
Read a .fastq or fastq.gz file into an in-memory record_list.
This is a time and memory intensive operation!
- Parameters
filepath (Path) – file path to a fastq.gz file
- Returns
list of biopython sequence records from the fastq file
- Return type
List[SeqRecord]
- pyllelic.process.index_bam(bamfile: pathlib.Path) bool [source]
Helper function to run external samtools index.
- Parameters
bamfile (Path) – filepath to bam file
- Returns
verification of samtools command, usually discarded
- Return type
bool
- pyllelic.process.make_records_to_dictionary(record_list: List[Bio.SeqRecord.SeqRecord]) Dict[str, Bio.SeqRecord.SeqRecord] [source]
- Take in list of biopython SeqRecords and output a dictionary
with keys of the record name.
- Parameters
record_list (List[SeqRecord]) – biopython sequence records from a fastq file
- Returns
dict of biopython SeqRecords from a fastq file
- Return type
Dict[str, SeqRecord]
- pyllelic.process.prepare_genome(index: pathlib.Path, aligner: Optional[pathlib.Path] = None) str [source]
Helper function to run external bismark genome preparation tool.
Uses genomes from, e.g.: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
Bismark documentation at: https://github.com/FelixKrueger/Bismark/tree/master/Docs
- Parameters
index (Path) – filepath to unprocessed genome file.
aligner (Optional[Path]) – filepath to bowtie2 alignment program.
- Returns
output from genome preparation shell command, usually discarded
- Return type
str
- Raises
ShellCommandError – bismark_genome_preparation is not installed.
- pyllelic.process.retrieve_seq(filename: str, chrom: str, start: int, end: int) None [source]
Retrieve the genomic sequence of interest from UCSC Genome Browser.
- Parameters
filename (str) – path to store genomic sequence
chrom (str) – chromosome of interest, e.g. “chr5”
start (int) – start position for region of interest
end (int) – end position for region of interest
pyllelic.pyllelic module
pyllelic: a tool for detection of allelic-specific variation in reduced representation bisulfate DNA sequencing.
- class pyllelic.pyllelic.AD_stats(sig: bool, stat: Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]], crits: List[Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]])[source]
Bases:
NamedTuple
Helper class for NamedTuple results from anderson_darling_test
- crits: List[Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]]
Alias for field number 2
- sig: bool
Alias for field number 0
- stat: Union[numpy.typing._array_like._SupportsArray[numpy.dtype], numpy.typing._nested_sequence._NestedSequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy.typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]
Alias for field number 1
- class pyllelic.pyllelic.BamOutput(sam_directory: pathlib.Path, genome_string: str, config: pyllelic.config.Config)[source]
Bases:
object
Storage container to process BAM sequencing files and store processed results.
- genome_values: Dict[str, str]
dictionary of read files and contents.
- Type
Dict[str, str]
- name: str
path to bam file analyzed.
- Type
str
- positions: List[str]
index of genomic positions in the bam file.
- Type
“pd.Index
- values: Dict[str, str]
dictionary of reads at a given position
- Type
Dict[str, str]
- class pyllelic.pyllelic.GenomicPositionData(config: pyllelic.config.Config, files_set: List[str])[source]
Bases:
object
Class to process reduced representation bisulfite methylation sequencing data.
When initialized, GenomicPositionData reads sequencing file (.bam) locations from a config object, and then automatically performs alignment into BamOutput objects, and then performs methylation analysis, storing the results as QumaResults.
Finally, the aggregate data is analyzed to create some aggregate metrics such as means, modes, and differences (diffs), as well as expose methods for plotting and statistical analysis.
- allelic_data: pandas.core.frame.DataFrame
dataframe of Chi-squared p-values.
- Type
pd.DataFrame
- cell_types: List[str]
list of cell types in the data.
- Type
List[str]
- config: pyllelic.config.Config
pyllelic config object.
- Type
- diffs: pandas.core.frame.DataFrame
df of difference mean minus mode methylation values.
- Type
pd.DataFrame
- file_names: List[str]
list of bam filenames in the data.
- Type
List[str]
- files_set: List[str]
list of bam files analyzed.
- Type
List[str]
- static from_pickle(filename: str) pyllelic.pyllelic.GenomicPositionData [source]
Read pickled GenomicPositionData back to an object.
- Parameters
filename (str) – filename to read pickle
- Returns
GenomicPositionData object
- Return type
- generate_ad_stats() pandas.core.frame.DataFrame [source]
Generate Anderson-Darling normality statistics for an individual data df.
- Returns
df of a-d test statistics
- Return type
pd.DataFrame
- heatmap(min_values: int, width: int = 800, height: int = 2000, cell_lines: Optional[List[str]] = None, data_type: str = 'means', backend: Optional[str] = None) None [source]
Display a graph figure showing heatmap of mean methylation across cell lines.
- Parameters
min_values (int) – minimum number of points data must exist at a position
width (int) – figure width, defaults to 800
height (int) – figure height, defaults to 2000
cell_lines (Optional[List[str]]) – set of cell lines to analyze,
lines. (defaults to all cell) –
data_type (str) – type of data to plot. Can be ‘means’, ‘modes’, ‘diffs’, or ‘pvalue’.
backend (Optional[str]) – plotting backend to override default
- Raises
ValueError – invalid data type
ValueError – invalid plotting backend
- histogram(cell_line: str, position: str, backend: Optional[str] = None) None [source]
Display a graph figure showing fractional methylation in a given cell line at a given site.
- Parameters
cell_line (str) – name of cell line
position (str) – genomic position
backend (Optional[str]) – plotting backend to override default
- Raises
ValueError – invalid plotting backend
ValueError – No data available at that position
- individual_data: pandas.core.frame.DataFrame
dataframe of individual methylation values.
- Type
pd.DataFrame
- means: pandas.core.frame.DataFrame
dataframe of mean methylation values.
- Type
pd.DataFrame
- modes: pandas.core.frame.DataFrame
dataframe of modes of methylation values.
- Type
pd.DataFrame
- positions: List[str]
list of genomic positions in the data.
- Type
List[str]
- quma_results: Dict[str, pyllelic.pyllelic.QumaResult]
list of QumaResults.
- Type
Dict[str, QumaResult]
- reads_graph(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None [source]
Display a graph figure showing methylation of reads across cell lines.
- Parameters
cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.
backend (Optional[str]) – plotting backend to override default
- Raises
ValueError – invalid plotting backend
ValueError – Unable to plot more than 20 cell lines at once.
- save(filename: str = 'output.xlsx') None [source]
Save quma results to an excel file.
- Parameters
filename (str) – Filename to save to. Defaults to “output.xlsx”.
- save_pickle(filename: str) None [source]
Save GenomicPositionData object as a pickled file.
- Parameters
filename (str) – filename to save pickle
- sig_methylation_differences(cell_lines: Optional[List[str]] = None, backend: Optional[str] = None) None [source]
Display a graph figure showing a bar chart of significantly different mean / mode methylation across all or a subset of cell lines.
- Parameters
cell_lines (Optional[List[str]]) – set of cell lines to analyze, defaults to all cell lines.
backend (Optional[str]) – plotting backend to override default
- Raises
ValueError – invalid plotting backend
- summarize_allelic_data(cell_lines: Optional[List[str]] = None) pandas.core.frame.DataFrame [source]
Create a dataframe only of likely allelic methylation positions.
- Parameters
cell_lines (Optional[List[str]]) – set of cell lines to analyze,
lines. (defaults to all cell) –
- Returns
dataframe of cell lines with likely allelic positions
- Return type
pd.DataFrame
- class pyllelic.pyllelic.QumaResult(read_files: List[str], genomic_files: List[str], positions: List[str])[source]
Bases:
object
Storage container to process and store quma-style methylation results.
- quma_output: List[pyllelic.quma.Quma]
list of Quma result objects.
- Type
List[quma.Quma]
- values: pandas.core.frame.DataFrame
dataframe of quma methylation analysis values.
- Type
pd.DataFrame
- pyllelic.pyllelic.configure(base_path: str, prom_file: str, prom_start: int, prom_end: int, chrom: str, offset: int, test_dir: Optional[str] = None, fname_pattern: Optional[str] = None, viz_backend: Optional[str] = None, results_dir: Optional[str] = None) pyllelic.config.Config [source]
Helper method to set up all our environmental variables, such as for testing.
- Parameters
base_path (str) – directory where all processing will occur, put .bam files in “test” sub-directory in this folder
prom_file (str) – filename of genmic sequence of promoter region of interest
prom_start (int) – start position to analyze in promoter region
prom_end (int) – final position to analyze in promoter region
chrom (str) – chromosome promoter is located on
offset (int) – genomic position of promoter to offset reads
test_dir (Optional[str]) – name of test directory where bam files are located
fname_pattern (Optional[str]) – regex pattern for processing filenames
viz_backend (Optional[str]) – which plotting backend to use
results_dir (Optional[str]) – name of results directory
- Returns
configuration dataclass instance.
- Return type
- pyllelic.pyllelic.make_list_of_bam_files(config: pyllelic.config.Config) List[str] [source]
Check analysis directory for all valid .bam files.
- Parameters
config (Config) – pyllelic configuration options.
- Returns
list of files
- Return type
list[str]
- pyllelic.pyllelic.pyllelic(config: pyllelic.config.Config, files_set: List[str]) pyllelic.pyllelic.GenomicPositionData [source]
Wrapper to call pyllelic routines.
- Parameters
config (Config) – pyllelic config object.
files_set (List[str]) – list of bam files to analyze.
- Returns
GenomicPositionData pyllelic object.
- Return type
pyllelic.quma module
Tools to quantify methylation in reduced representation bisulfite sequencing reads.
- class pyllelic.quma.Fasta(com: str = '', pos: Optional[str] = None, seq: str = '')[source]
Bases:
object
Dataclass to wrap fasta results.
- com: str = ''
- pos: Optional[str] = None
- seq: str = ''
- class pyllelic.quma.Quma(gfile_contents: str, qfile_contents: str)[source]
Bases:
object
Quma methylation analysis parser for bisulfite conversion DNA sequencing.
- data: List[pyllelic.quma.Reference]
QUMA Output in object form.
- values: str
QUMA output values in tabular form.
- class pyllelic.quma.Reference(fasta: pyllelic.quma.Fasta, res: pyllelic.quma.Result, dir: int, gdir: int, exc: int)[source]
Bases:
object
Dataclass of quma analysis intermediates.
Includes fasta sequence, quma results, directon of read, genomic direction, and whether result meets exclusion criteria.
- dir: int
- exc: int
- fasta: pyllelic.quma.Fasta
- gdir: int
- class pyllelic.quma.Result(qAli: str = '', gAli: str = '', val: str = '', perc: float = 0.0, pconv: float = 0.0, gap: int = 0, menum: int = 0, unconv: int = 0, conv: int = 0, match: int = 0, aliMis: int = 0, aliLen: int = 0)[source]
Bases:
object
Dataclass of quma aligment comparison results.
- aliLen: int = 0
- aliMis: int = 0
- conv: int = 0
- gAli: str = ''
- gap: int = 0
- match: int = 0
- pconv: float = 0.0
- perc: float = 0.0
- qAli: str = ''
- unconv: int = 0
- val: str = ''
pyllelic.visualization module
Utilities to visualize data for use in pyllelic.
- pyllelic.visualization._create_heatmap(df: pandas.core.frame.DataFrame, min_values: int, width: int, height: int, title_type: str, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure] [source]
Generate a graph figure showing heatmap of mean methylation across cell lines.
- Parameters
df (pd.DataFrame) – dataframe of mean methylation
min_values (int) – minimum number of points data must exist at a position
width (int) – figure width
height (int) – figure height
title_type (str) – type of figure being plotted
backend (str) – which plotting backend to use
- Returns
plotly or matplotlib figure object
- Return type
Union[go.Figure, plt.Figure]
- Raises
ValueError – invalid plotting backend
- pyllelic.visualization._create_histogram(data: pandas.core.frame.DataFrame, cell_line: str, position: str, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure] [source]
Generate a graph figure showing fractional methylation in a given cell line at a given site.
- Parameters
data (pd.DataFrame) – dataframe of individual data
cell_line (str) – name of cell line
position (str) – genomic position
backend (str) – which plotting backend to use
- Returns
plotly or matplotlib figure object
- Return type
Union[go.Figure, plt.Figure]
- Raises
ValueError – invalid plotting backend provided
- pyllelic.visualization._create_methylation_diffs_bar_graph(df: pandas.core.frame.DataFrame, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure] [source]
Generate a graph figure showing bar graph of significant methylation across cell lines.
- Parameters
df (pd.DataFrame) – dataframe of significant methylation positions
backend (str) – which plotting backend to use
- Returns
plotly or matplotlib figure object
- Return type
Union[go.Figure, plt.Figure]
- Raises
ValueError – invalid plotting backend
- pyllelic.visualization._make_methyl_df(df: pandas.core.frame.DataFrame, row: str) pandas.core.frame.DataFrame [source]
- pyllelic.visualization._make_stacked_fig(df: pandas.core.frame.DataFrame, backend: str) Union[plotly.graph_objs._figure.Figure, matplotlib.figure.Figure] [source]
Generate a graph figure showing methylated and unmethylated reads across cell lines.
- Parameters
df (pd.DataFrame) – dataframe of individual read data
backend (str) – plotting backend to use
- Returns
plotly or matplotlib figure
- Return type
Union[go.Figure, plt.Figure]
- Raises
ValueError – invalid plotting backend
- pyllelic.visualization._make_stacked_mpl_fig(df: pandas.core.frame.DataFrame) matplotlib.figure.Figure [source]
Generate a graph figure showing methylated and unmethylated reads across cell lines.
- Parameters
df (pd.DataFrame) – dataframe of individual read data
- Returns
matplotlib figure
- Return type
plt.Figure
- pyllelic.visualization._make_stacked_plotly_fig(df: pandas.core.frame.DataFrame) plotly.graph_objs._figure.Figure [source]
Generate a graph figure showing methylated and unmethylated reads across cell lines.
- Parameters
df (pd.DataFrame) – dataframe of individual read data
- Returns
plotly figure
- Return type
go.Figure