Help on gtftk Unix commands¶
Main parser arguments of gtftk¶
Getting help with -h¶
The -h argument can be used to get a synopsis for implemented commands.
$ gtftk -h
Usage: gtftk [-h] [-b] [-p] [-s] [-v] [-l] ...
A toolbox to handle GTF files.
Example:
gtftk get_example -f chromInfo -o simple.chromInfo ;
gtftk get_example | gtftk feature_size -t mature_rna | gtftk nb_exons |\
gtftk intron_sizes | gtftk exon_sizes | gtftk convergent -u 24 -d 24 -c simple.chromInfo | \
gtftk divergent -u 101 -d 10 -c simple.chromInfo | \
gtftk overlapping -u 0 -d 0 -t transcript -c simple.chromInfo -a | \
gtftk select_by_key -k feature -v transcript | gtftk tabulate -k "*" -b
Type 'gtftk sub-command -h' for more information.
Main command arguments:
-h, --help show this help message and exit
-b, --bash-comp Get a script to activate bash completion. (default: False)
-p, --plugin-tests Display bats tests for all plugin. (default: False)
-s, --system-info Display some info about the system. (default: False)
-v, --version show program's version number and exit
-l, --list-plugins Get the list of plugins. (default: False)
Available sub-commands/plugins:
------- editing --------
add_prefix Add a prefix or suffix to target values.
del_attr Delete attributes in the target gtf file.
discretize_key Create a new key through discretization of a numeric key.
join_attr Join attributes from a tabulated file.
join_multi_file Join attributes from mutiple files.
merge_attr Merge a set of attributes into a destination attribute.
----- information ------
add_exon_nb Add exon number transcript-wise.
apropos Search in all command description files those related to a user-defined keyword.
count Count the number of features in the gtf file.
count_key_values Count the number values for a set of keys.
feature_size Compute the size of features enclosed in the GTF.
get_attr_list Get the list of attributes from a GTF file.
get_attr_value_list Get the list of values observed for an attributes.
get_example Get example files including GTF.
get_feature_list Get the list of features enclosed in the GTF.
nb_exons Count the number of exons by transcript.
nb_transcripts Count the number of transcript per gene.
retrieve Retrieve a GTF file from ensembl.
seqid_list Returns the chromosome list.
tss_dist Computes the distance between TSS of gene transcripts.
------ selection -------
random_list Select a random list of genes or transcripts.
random_tx Select randomly up to m transcript for each gene.
rm_dup_tss If several transcripts of a gene share the same TSS, select only one representative.
select_by_go Select lines from a GTF file using a Gene Ontology ID.
select_by_intron_size Select transcripts by intron size.
select_by_key Select lines from a GTF based on attributes and values.
select_by_loc Select transcript/gene overlapping a genomic feature.
select_by_max_exon_nb For each gene select the transcript with the highest number of exons.
select_by_nb_exon Select transcripts based on the number of exons.
select_by_numeric_value Select lines from a GTF file based on a boolean test on numeric values.
select_by_regexp Select lines from a GTF file based on a regexp.
select_by_tx_size Select transcript based on their size (i.e size of mature/spliced transcript).
select_most_5p_tx Select the most 5' transcript of each gene.
short_long Get the shortest or longest transcript of each gene
------ conversion ------
bed_to_gtf Convert a bed file to a gtf but with lots of empty fields...
convert Convert a GTF to various format including bed.
convert_ensembl Convert the GTF file to ensembl format. Essentially add 'transcript'/'gene' features.
tabulate Convert a GTF to tabulated format.
------ annotation ------
closest_genes Find the n closest genes for each transcript.
convergent Find transcripts with convergent tts.
divergent Find transcripts with divergent promoters.
exon_sizes Add a new key to transcript features containing a comma separated list of exon sizes.
intron_sizes Add a new key to transcript features containing a comma separated list of intron sizes.
overlapping Find (non)overlapping transcripts.
------- sequence -------
get_feat_seq Get feature sequence (e.g exon, UTR...).
get_tx_seq Get transcript sequences in fasta format.
----- coordinates ------
get_5p_3p_coords Get the 5p or 3p coordinate for each feature. TSS or TTS for a transcript.
intergenic Extract intergenic regions.
intronic Extract intronic regions.
midpoints Get the midpoint coordinates for the requested feature.
shift Transpose coordinates.
splicing_site Compute the locations of donor and acceptor splice sites.
------- coverage -------
coverage Compute bigwig coverage in body, promoter, tts...
mk_matrix Compute a coverage matrix (see profile).
profile Create coverage profile using a bigWig as input.
----- miscellaneous ----
col_from_tab Select columns from a tabulated file based on their names.
control_list Returns a list of gene matched for expression based on reference values.
------------------------
Activating Bash completion¶
The code provided below can be useful to activate bash completion.
# Use the -b argument of gtftk
# This will produce a script that you
# should store in your .bashrc
gtftk -b
Or alternatively
echo "" >> ~/.bashrc
gtftk -b >> ~/.bashrc
Command-wide arguments¶
Description: The following arguments are available in almost all gtftk commands :
- -h, –help : Refers to argument list and details.
- -i, –inputfile: Refers to the input file (may be <stdin>).
- -o, –outputfile: Refers to the output file (may be <stdout>).
- -D, –no-date: Do not add date to output file names.
- -C, –add-chr: Add ‘chr’ to chromosome names before printing output.
- -V, –verbosity: Increase output verbosity (can take value from 0 to 4).
- -K –tmp-dir: Keep all temporary files into this folder.
- -L, –logger-file: Store the values of all command line arguments into a file.
Commands from section ‘information’¶
apropos¶
Description: Search in all command description files those related to a user-defined keyword.
Example: Search all commands related to promoters.
$ gtftk apropos -k promoter
|-- 17:42-INFO-apropos : >> Keyword 'promoter' was found in the following command:
- coverage.
- divergent.
Arguments:
$ gtftk apropos -h
Usage: gtftk apropos -k keyword [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Search in all command description files those related to a user-defined keyword.
Version: 2017-09-27
Arguments:
-k, --keyword The keyword (default: None)
-n, --notes Look also for the keywords in notes associated to each command. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
retrieve¶
Description: Retrieve a GTF file from ensembl.
Example: List the available GTF files in ensembl FTP. Bacteria are not listed at the moment.
$ # gtftk retrieve -l | head -5
Example: Perform basic statistics on Vicugna pacos genomic annotations.
$ # gtftk retrieve -s vicugna_pacos -c -d | gtftk count -t vicugna_pacos
Arguments:
$ gtftk retrieve -h
Usage: gtftk retrieve [-s SPECIES] [-o GTF.GZ] [-e {vertebrate,protists,fungi,plants,metazoa}] [-r RELEASE] [-l] [-hs] [-c] [-d] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Retrieve a GTF file from ensembl.
Version: 2018-01-31
Arguments:
-s, --species-name The species name. (default: homo_sapiens)
-o, --outputfile Output file (gtf.gz). (default: None)
-e, --ensembl-collection Which ensembl collection to interrogate('vertebrate', 'protists', 'fungi', 'plants', 'metazoa'). (default: vertebrate)
-r, --release Release number. By default, the latest. (default: None)
-l, --list-only If selected, list available species. (default: False)
-hs, --hide-species-name If selected, hide species names upon listing. (default: False)
-c, --to-stdout This option specifies that output will go to the standard output stream, leaving downloaded files intact (or not, see -d). (default: False)
-d, --delete Delete the gtf file after processing (e.g if -c is used). (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
get_example¶
Description: Get an example GTF file (or any other kind of example available in the installation directory). This command is only provided for demonstration purpose.
We can see from the example below that this gtf file follows the ensembl format and contains the transcript and gene features (column 3).
Example: The very basic (and artificial example).
$ gtftk get_example| head -2
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
Example: A more realistic example containing a subset of transcript (n=8531) corresponding to 1058 genes from human annotation.
$ gtftk get_example -d mini_real | gtftk count
gene 1058
transcript 8531
exon 64251
five_prime_utr 7561
CDS 41371
start_codon 4171
stop_codon 3753
three_prime_utr 6972
Selenocysteine 2
let’s get all files from the simple dataset.
$ gtftk get_example -d simple -f '*'
|-- 17:42-INFO-get_example : Copying: add_attr_to_pos.tab
|-- 17:42-INFO-get_example : Copying: simple.1.bt2
|-- 17:42-INFO-get_example : Copying: simple.2.bt2
|-- 17:42-INFO-get_example : Copying: simple.2.bw
|-- 17:42-INFO-get_example : Copying: simple.3.bt2
|-- 17:42-INFO-get_example : Copying: simple.3.bw
|-- 17:42-INFO-get_example : Copying: simple.4.bt2
|-- 17:42-INFO-get_example : Copying: simple.bam
|-- 17:42-INFO-get_example : Copying: simple.bam.bai
|-- 17:42-INFO-get_example : Copying: simple.bw
|-- 17:42-INFO-get_example : Copying: simple.chromInfo
|-- 17:42-INFO-get_example : Copying: simple.fa
|-- 17:42-INFO-get_example : Copying: simple.fa.fai
|-- 17:42-INFO-get_example : Copying: simple.geneList
|-- 17:42-INFO-get_example : Copying: simple.genes
|-- 17:42-INFO-get_example : Copying: simple.gtf
|-- 17:42-INFO-get_example : Copying: simple.join
|-- 17:42-INFO-get_example : Copying: simple.join_mat
|-- 17:42-INFO-get_example : Copying: simple.join_mat_2
|-- 17:42-INFO-get_example : Copying: simple.join_mat_3
|-- 17:42-INFO-get_example : Copying: simple.join_with_dup
|-- 17:42-INFO-get_example : Copying: simple.loc_bed
|-- 17:42-INFO-get_example : Copying: simple.map
|-- 17:42-INFO-get_example : Copying: simple.rev.1.bt2
|-- 17:42-INFO-get_example : Copying: simple.rev.2.bt2
|-- 17:42-INFO-get_example : Copying: simple_col.csv
|-- 17:42-INFO-get_example : Copying: simple_peaks.bed
|-- 17:42-INFO-get_example : Copying: simple_peaks.bed3
|-- 17:42-INFO-get_example : Copying: simple_peaks.bed6
|-- 17:42-INFO-get_example : Copying: simple_reads.fq
Arguments:
$ gtftk get_example -h
Usage: gtftk get_example [-d {simple,mini_real,mini_real_noov_rnd_tx,simple_02,simple_03,simple_04,simple_05}] [-f {*,gtf,bed,bw,bam,join,join_mat,chromInfo,fa,fa.idx,genes,geneList,2.bw,genome}] [-o OUTPUT] [-l] [-a] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Print example files including GTF.
Notes:
* Use format '*' to get all files from a dataset.
Version: 2018-01-20
Arguments:
-d, --dataset Select a dataset. (default: simple)
-f, --format The dataset format. (default: gtf)
-o, --outputfile Output file. (default: <stdout>)
-l, --list Only list files of a dataset. (default: False)
-a, --all-dataset Get the list of all datasets. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
add_exon_nb¶
Description: Add exon number transcript-wise (based on 5’ to 3’ orientation).
Example:
$ gtftk get_example -f gtf | gtftk add_exon_nb | gtftk select_by_key -k feature -v exon
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001"; exon_nbr "1";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001"; exon_nbr "1";
chr1 gtftk exon 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001"; exon_nbr "1";
chr1 gtftk exon 50 54 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001"; exon_nbr "2";
chr1 gtftk exon 57 61 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002"; exon_nbr "1";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001"; exon_nbr "1";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002"; exon_nbr "2";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003"; exon_nbr "3";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001"; exon_nbr "1";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002"; exon_nbr "2";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E003"; exon_nbr "3";
chr1 gtftk exon 33 35 . - . gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E001"; exon_nbr "2";
chr1 gtftk exon 42 47 . - . gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E002"; exon_nbr "1";
chr1 gtftk exon 22 25 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E001"; exon_nbr "3";
chr1 gtftk exon 28 30 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E002"; exon_nbr "2";
chr1 gtftk exon 33 35 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E003"; exon_nbr "1";
chr1 gtftk exon 28 30 . - . gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E001"; exon_nbr "2";
chr1 gtftk exon 33 35 . - . gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E002"; exon_nbr "1";
chr1 gtftk exon 107 116 . + . gene_id "G0007"; transcript_id "G0007T001"; exon_id "G0007T001E001"; exon_nbr "1";
chr1 gtftk exon 107 116 . + . gene_id "G0007"; transcript_id "G0007T002"; exon_id "G0007T002E001"; exon_nbr "1";
chr1 gtftk exon 210 214 . - . gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E001"; exon_nbr "2";
chr1 gtftk exon 220 222 . - . gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E002"; exon_nbr "1";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001"; exon_nbr "1";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001"; exon_nbr "1";
chr1 gtftk exon 176 186 . + . gene_id "G0010"; transcript_id "G0010T001"; exon_id "G0010T001E001"; exon_nbr "1";
$ gtftk get_example -f gtf | gtftk add_exon_nb -k exon_number | gtftk select_by_key -k feature -v exon | gtftk tabulate -k chrom,start,end,exon_number,transcript_id | head -n 20
seqid start end exon_number transcript_id
chr1 125 138 1 G0001T002
chr1 125 138 1 G0001T001
chr1 180 189 1 G0002T001
chr1 50 54 2 G0003T001
chr1 57 61 1 G0003T001
chr1 65 68 1 G0004T002
chr1 71 71 2 G0004T002
chr1 74 76 3 G0004T002
chr1 65 68 1 G0004T001
chr1 71 71 2 G0004T001
chr1 74 76 3 G0004T001
chr1 33 35 2 G0005T001
chr1 42 47 1 G0005T001
chr1 22 25 3 G0006T001
chr1 28 30 2 G0006T001
chr1 33 35 1 G0006T001
chr1 28 30 2 G0006T002
chr1 33 35 1 G0006T002
chr1 107 116 1 G0007T001
Arguments:
$ gtftk add_exon_nb -h
Usage: gtftk add_exon_nb [-i GTF] [-o GTF] [-k exon_numbering_key] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Add exon number transcript-wise (based on 5' to 3' orientation).
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --exon-numbering-key The name of the key containing the exon numbering. (default: exon_nbr)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
count¶
Description: Count the number of features (transcripts, genes, exons, introns).
Example:
$ gtftk get_example -f gtf | gtftk count -t example_gtf
gene 10 example_gtf
transcript 15 example_gtf
exon 25 example_gtf
CDS 20 example_gtf
Arguments:
$ gtftk count -h
Usage: gtftk count [-i GTF] [-o TXT] [-d header] [-t TEXT] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Count the number of each features in the gtf file.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-d, --header A comma-separated list of string to use as header. (default: None)
-t, --additional-text A facultative text to be printed in the third column (e.g species name). (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
count_key_values¶
Description: Count the number values for a set of keys.
Example: Count the number of time gene_id and transcript_id appear in the GTF file.
$ gtftk get_example | gtftk count_key_values -k gene_id,transcript_id
gene_id 70
transcript_id 70
Example: Count the number of non-redondant entries for chromosomes and transcript_id.
$ gtftk get_example | gtftk count_key_values -k chrom,transcript_id -u
chrom 1
transcript_id 16
Arguments:
$ gtftk count_key_values -h
Usage: gtftk count_key_values [-i GTF] [-o TXT] [-k keys] [-t TEXT] [-u] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Count the number of values/levels for a set of keys.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --keys The set of keys of interest. (default: *)
-t, --additional-text A facultative text to be printed in the third column (e.g species name). (default: None)
-u, --uniq Ask for the count of non redondant values. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
get_attr_list¶
Description: Get the list of attributes from a GTF file.
Example: Get the list of attributes in the “simple” dataset.
$ gtftk get_example | gtftk get_attr_list
gene_id
transcript_id
exon_id
ccds_id
Arguments:
$ gtftk get_attr_list -h
Usage: gtftk get_attr_list [-i GTF] [-o TXT] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get the list of attributes from a GTF file.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --separator The separator to be used for separating key names. (default: )
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
get_attr_value_list¶
Description: Get the list of values observed for an attributes.
Example: Get the list of values observed for transcript_id.
$ gtftk get_example | gtftk get_attr_value_list -k transcript_id
G0001T001
G0001T002
G0002T001
G0003T001
G0004T001
G0004T002
G0005T001
G0006T001
G0006T002
G0007T001
G0007T002
G0008T001
G0009T001
G0009T002
G0010T001
Example: Get the number of time each gene_id is used.
$ gtftk get_example | gtftk get_attr_value_list -k gene_id -c -s ';'
G0001;7
G0002;4
G0003;5
G0004;13
G0005;5
G0006;13
G0007;7
G0008;5
G0009;7
G0010;4
Arguments:
$ gtftk get_attr_value_list -h
Usage: gtftk get_attr_value_list [-i GTF] [-o TXT] -k key_name [-s SEP] [-c] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get the list of values observed for an attributes.
Version: 2018-02-11
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-name Key name. (default: None)
-s, --separator The separator to be used for separating key names. (default: )
-c, --count Add the counts for each key (separator will be set to ' ' by default). (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
get_feature_list¶
Description: Get the list of features enclosed in the GTF.
Example: Get the list of features enclosed in the GTF.
$ gtftk get_example | gtftk get_feature_list
gene
transcript
exon
CDS
Arguments:
$ gtftk get_feature_list -h
Usage: gtftk get_feature_list [-i GTF] [-o TXT] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get the list of features enclosed in the GTF.
Version: 2018-02-11
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --separator The separator to be used for separating key names. (default: )
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
nb_exons¶
Description: Count the number of exons and add it as a novel key/value. Output may also be in text format if requested.
Example:
$ gtftk get_example -f gtf | gtftk nb_exons | head -n 5
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; nb_exons "1";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1 gtftk CDS 125 130 . + . gene_id "G0001"; transcript_id "G0001T002"; ccds_id "CDS_G0001T002";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; nb_exons "1";
$ gtftk get_example -f gtf | gtftk nb_exons | gtftk select_by_key -k feature -v transcript | head -n 5
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; nb_exons "1";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; nb_exons "1";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; nb_exons "1";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; nb_exons "2";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; nb_exons "3";
Arguments:
$ gtftk nb_exons -h
Usage: gtftk nb_exons [-i GTF] [-o TXT/GTF] [-f] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Returns the transcript name and number of exons with nb_exons as a novel key for each transcript
feature.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-f, --text-format Return a text format. (default: False)
-a, --key-name The name of the key. (default: nb_exons)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
nb_transcripts¶
Description: Count the number of transcript per gene.
Example: Count the number of transcript per gene.
$ gtftk get_example | gtftk nb_transcripts | gtftk select_by_key -g
chr1 gtftk gene 125 138 . + . gene_id "G0001"; nb_tx "2";
chr1 gtftk gene 180 189 . + . gene_id "G0002"; nb_tx "1";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; nb_tx "1";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; nb_tx "2";
chr1 gtftk gene 33 47 . - . gene_id "G0005"; nb_tx "1";
chr1 gtftk gene 22 35 . - . gene_id "G0006"; nb_tx "2";
chr1 gtftk gene 107 116 . + . gene_id "G0007"; nb_tx "2";
chr1 gtftk gene 210 222 . - . gene_id "G0008"; nb_tx "1";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; nb_tx "2";
chr1 gtftk gene 176 186 . + . gene_id "G0010"; nb_tx "1";
Arguments:
$ gtftk nb_transcripts -h
Usage: gtftk nb_transcripts [-i GTF] [-o GTF/TXT] [-f] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Compute the number of transcript per gene.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-f, --text-format Return a text format. (default: False)
-a, --key-name If gtf format is requested, the name of the key. (default: nb_tx)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
seqid_list¶
Description: Returns the chromosome list.
Example: Returns the chromosome list.
$ gtftk get_example | gtftk seqid_list
chr1
Arguments:
$ gtftk seqid_list -h
Usage: gtftk seqid_list [-i GTF] [-o TXT] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select the seqid/chromosomes
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --separator The separator to be used for separating key names. (default: )
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
tss_dist¶
Description: Computes the distance between TSSs of pairs of gene transcripts. The tss_num_1/tss_num_1 columns contains the numbering of TSSs (transcript_id_1 and transcript_id_2 respectively) for each gene. Numering starts from 1 (most 5’ TSS) to the number of different TSS coordinates. Two or more transcripts will have the same tss_num if they share a TSS.
Example: Returns the chromosome list.
$ gtftk get_example -d mini_real | gtftk tss_dist | head -n 10
gene_id transcript_id_1 transcript_id_2 dist tss_num_1 tss_num_2
ENSG00000187608 ENST00000624652 ENST00000379389 12278 2 3
ENSG00000187608 ENST00000624697 ENST00000379389 12285 1 3
ENSG00000187608 ENST00000624697 ENST00000624652 7 1 2
ENSG00000175756 ENST00000338338 ENST00000321751 326 1 3
ENSG00000175756 ENST00000321751 ENST00000338370 12 3 5
ENSG00000175756 ENST00000378853 ENST00000321751 4 2 3
ENSG00000175756 ENST00000321751 ENST00000489799 25 3 6
ENSG00000175756 ENST00000321751 ENST00000496905 10 3 4
ENSG00000175756 ENST00000338338 ENST00000338370 338 1 5
Arguments:
$ gtftk tss_dist -h
Usage: gtftk tss_dist [-i GTF] [-o TXT] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Computes the distance between TSSs of pairs of gene transcripts.
Notes:
* The tss_num_1/tss_num_1 columns contains the numbering of TSSs (transcript_id_1 and
transcript_id_2 respectively) for each gene.
* Numering starts from 1 (most 5' TSS) to the number of different TSS coordinates.
* Thus two or more transcripts will have the same tss_num if they share a TSS.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
feature_size¶
Description: Get the size and limits (start/end) of features enclosed in the GTF. If bed format is requested returns the limits in bed format and the size as a score. Otherwise output GTF file with ‘feat_size’ as a new key and size as value
Example: Add trancript size (mature RNA) to the gtf.
$ gtftk get_example | gtftk feature_size -t mature_rna | gtftk select_by_key -k feature -v transcript | head -n 5
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; feat_size "14";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; feat_size "14";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; feat_size "10";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; feat_size "10";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; feat_size "8";
Example: Add trancript size (genomic coverage) to the gtf.
$ gtftk get_example | gtftk feature_size -t transcript | gtftk select_by_key -k feature -v transcript | head -n 5
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; feat_size "14";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; feat_size "14";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; feat_size "10";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; feat_size "12";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; feat_size "12";
Example: Get exon size and limits in BED format.
$ gtftk get_example | gtftk feature_size -t exon -b -n feature,exon_id,gene_id| head -n 5
chr1 124 138 exon|G0001T002E001|G0001|exon 14 +
chr1 124 138 exon|G0001T001E001|G0001|exon 14 +
chr1 179 189 exon|G0002T001E001|G0002|exon 10 +
chr1 49 54 exon|G0003T001E001|G0003|exon 5 -
chr1 56 61 exon|G0003T001E002|G0003|exon 5 -
Arguments:
$ gtftk feature_size -h
Usage: gtftk feature_size [-i GTF] [-o GTF/BED] [-t ft_type] [-n NAME] [-k KEY_NAME] [-s SEP] [-b] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get the size and limits (start/end) of features enclosed in the GTF. The feature can be of any
type (as found in the 3rd column of the GTF) or 'mature_rna' to get transcript size (i.e
without introns). If bed format is requested returns the limits in bed format and the size as a
score. Otherwise output GTF file with 'feat_size' as a new key and size as value.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-t, --ft-type A target feature (as found in the 3rd column of the GTF) or 'mature_rna' to get transcript size (without introns). (default: transcript)
-n, --names The key(s) that should be used as name if bed is requested. (default: transcript_id)
-k, --key-name Name for the new key if GTF format is requested. (default: feat_size)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
-b, --bed Output bed format. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘Editing’¶
add_prefix¶
Description: Add a prefix (or suffix) to one of the attribute value (e.g. gene_id)
Example:
$ gtftk get_example| gtftk add_prefix -k transcript_id -t "novel_"| head -2
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "novel_G0001T002";
$ gtftk get_example| gtftk add_prefix -k transcript_id -t "_novel" -s | head -2
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002_novel";
Arguments:
$ gtftk add_prefix -h
Usage: gtftk add_prefix [-i GTF] [-o GTF] [-k KEY] [-t TEXT] [-s] [-f target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Add a prefix to target values. By default add 'chr' to seqid/chromosome key.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The name of the attribute for which a prefix/suffix is to be added to the corresponding values (e.g, gene_id, transcript_id,...). (default: chrom)
-t, --text The character string to add as a prefix to the values. (default: chr)
-s, --suffix The character string to add as a prefix to the values. (default: False)
-f, --target-feature The name of the target feature. (default: *)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
del_attr¶
Description: Delete an attribute and its corresponding values.
Example:
$ gtftk get_example | gtftk del_attr -k transcript_id,gene_id,exon_id | head -3
chr1 gtftk gene 125 138 . + .
chr1 gtftk transcript 125 138 . + .
chr1 gtftk exon 125 138 . + .
$ gtftk get_example | gtftk del_attr -v -k transcript_id,gene_id | head -3 # delete all but transcript_id and gene_id
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
Arguments:
$ gtftk del_attr -h
Usage: gtftk del_attr [-i GTF] [-o GTF] -k KEY [-r] [-v] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Delete one or several attributes from the gtf file.
Notes:
* You may also use 'complex' regexp such as : "(^.*_id$|^.*_biotype$)"
* Example: gtftk get_example -d mini_real | gtftk del_attr -k "(^.*_id$|^.*_biotype$)" -r -v
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key Comma separated list of attribute names or a regular expression (see -r). (default: None)
-r, --reg-exp The key name is a regular expression. (default: False)
-v, --invert-match Delected keys are those not matching any of the specified key. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
join_attr¶
Description: Add attributes from a file. This command can be used to import additional key/values into the gtf (e.g CPAT for coding potential, DESeq for differential analysis,…). The imported file can be in 2 formats (2 columns or matrix):
- With a 2-columns file:
- value for joining (transcript_id or gene_id or …).
- corresponding value.
- With a matrix (see -m):
- rows corresponding to joining keys (transcript_id or gene_id or…).
- columns corresponding to novel attributes name.
- Each cell of the matrix is a value for the corresponding attribute.
Example: With a 2-columns file.
$ gtftk get_example -f join > simple_join.txt
$ cat simple_join.txt
G0003 0.2322
G0004 0.999
G0009 0.5555
$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join.txt -n a_score -t gene| gtftk select_by_key -k feature -v gene
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; a_score "0.2322";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; a_score "0.999";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; a_score "0.5555";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Example: With a matrix
$ gtftk get_example -f join_mat > simple_join_mat.txt
$ cat simple_join_mat.txt
genes S1 S2
G0003 0.2322 0.4
G0004 0.999 0.6
G0009 0.5555 0.7
$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join_mat.txt -m -t gene| gtftk select_by_key -k feature -v gene
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S1 "0.2322"; S2 "0.4";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S1 "0.999"; S2 "0.6";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Arguments:
$ gtftk join_attr -h
Usage: gtftk join_attr [-i GTF] [-o GTF] -k KEY -j JOIN_FILE [-H] [-m] [-n NEW_KEY] [-t target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Join attributes from a tabulated file.
Version: 2018-02-05
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-to-join The name of the key used to join (e.g transcript_id). (default: None)
-j, --join-file A two columns file with (i) the value for joining (e.g value for transcript_id), (ii) the value for novel key (e.g the coding potential computed value). (default: None)
-H, --has-header Indicates that the 'join-file' has a header. (default: False)
-m, --matrix 'join-file' expect a matrix with row names as target keys column names as novel key and each cell as value. (default: False)
-n, --new-key The name of the novel key. Mutually exclusive with --matrix. (default: None)
-t, --target-feature The name(s) of the target feature(s). Comma separated. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
join_multi_file¶
Description: Join attributes from mutiple files.
Example: Add key/value to gene feature.
$ gtftk get_example | gtftk join_multi_file -k gene_id -t gene simple.join_mat_2 simple.join_mat_3| gtftk select_by_key -g
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S3 "A"; S4 "B"; S5 "0.2322"; S6 "0.4";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S3 "C"; S4 "D"; S5 "0.999|0.999"; S6 "0.6|0.6";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S3 "E"; S4 "F"; S5 "0.5555|20"; S6 "0.7|30";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Arguments:
$ gtftk join_multi_file -h
Usage: gtftk join_multi_file [-i GTF] [-o GTF] -k KEY [-t target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L] matrice_files [matrice_files ...]
Description:
* Join attributes from mutiple files.
Version: 2018-02-05
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-to-join The name of the key used to join (e.g transcript_id). (default: None)
-t, --target-feature The name(s) of the target feature(s). Comma separated. (default: None)
matrice_files 'A set of matrix files with row names as target keys column names as novel key and each cell as value.
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
merge_attr¶
Description: Merge a set of attributes into a destination attribute.
Example: Merge gene_id and transcript_id into a new key associated to transcript features.
$ gtftk get_example | gtftk merge_attr -k transcript_id,gene_id -d txgn_id -s "|" -f transcript | gtftk select_by_key -t
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; txgn_id "G0001T002|G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; txgn_id "G0001T001|G0001";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; txgn_id "G0002T001|G0002";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; txgn_id "G0003T001|G0003";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; txgn_id "G0004T002|G0004";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001"; txgn_id "G0004T001|G0004";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001"; txgn_id "G0005T001|G0005";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001"; txgn_id "G0006T001|G0006";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002"; txgn_id "G0006T002|G0006";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001"; txgn_id "G0007T001|G0007";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T002"; txgn_id "G0007T002|G0007";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001"; txgn_id "G0008T001|G0008";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; txgn_id "G0009T002|G0009";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; txgn_id "G0009T001|G0009";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001"; txgn_id "G0010T001|G0010";
Example: Merge gene_id and transcript_id into an existing key (transcript_id) associated to transcript features.
$ gtftk get_example | gtftk merge_attr -k transcript_id,gene_id -d transcript_id -s "|" -f transcript | gtftk tabulate -k feature,transcript_id -x | head -n 6
feature transcript_id
gene ?
transcript G0001T002|G0001
exon G0001T002
CDS G0001T002
transcript G0001T001|G0001
Example: Merge gene_id and transcript_id into an existing key (transcript_id) associated to all features.
$ gtftk get_example | gtftk merge_attr -k transcript_id,gene_id -d transcript_id -s "|" -f "*" | gtftk tabulate -k feature,transcript_id -x | head -n 6
Traceback (most recent call last):
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/bin/gtftk", line 4, in <module>
__import__('pkg_resources').run_script('pygtftk==0.9.6', 'gtftk')
File "/Users/puthier/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 661, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/Users/puthier/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1441, in run_script
exec(code, namespace, namespace)
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/EGG-INFO/scripts/gtftk", line 83, in <module>
args = main()
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/EGG-INFO/scripts/gtftk", line 68, in main
CmdManager.run(args)
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/cmd_manager.py", line 987, in run
fun(**args)
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/plugins/merge_attr.py", line 92, in merge_attr
separator).write(outputfile,
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/gtf_interface.py", line 538, in merge_attr
self = self.add_attr_column(tmp_file, new_key)
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/gtf_interface.py", line 3369, in add_attr_column
tmp = input_file.readlines()
File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/tempfile.py", line 485, in func_wrapper
return func(*args, **kwargs)
io.UnsupportedOperation: not readable
feature transcript_id
Arguments:
$ gtftk join_multi_file -h
Usage: gtftk join_multi_file [-i GTF] [-o GTF] -k KEY [-t target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L] matrice_files [matrice_files ...]
Description:
* Join attributes from mutiple files.
Version: 2018-02-05
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-to-join The name of the key used to join (e.g transcript_id). (default: None)
-t, --target-feature The name(s) of the target feature(s). Comma separated. (default: None)
matrice_files 'A set of matrix files with row names as target keys column names as novel key and each cell as value.
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
discretize_key¶
Description: Create a new key by discretizing a numeric key. This can be helpful to create new classes on the fly that can be used subsequently. The default is to create equally spaced interval. The intervals can also be created by computing the percentiles (-p).
Example: Let say we have the following matrix giving expression level of genes (rows) in samples (columns). We could join this information to the GTF and later choose to transform key S1 into a new discretized key S1_d. We may apply particular labels to this factor using -l.
$ gtftk get_example | gtftk join_attr -j simple.join_mat -k gene_id -m | gtftk discretize_key -k S1 -d S1_d -n 2 | gtftk select_by_key -k feature -v gene
|-- 17:43-INFO-discretize_key : Categories: ['(0.231_0.616]', '(0.616_0.999]']
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S1 "0.2322"; S2 "0.4"; S1_d "(0.231_0.616]";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S1 "0.999"; S2 "0.6"; S1_d "(0.616_0.999]";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7"; S1_d "(0.231_0.616]";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
$ gtftk get_example | gtftk join_attr -j simple.join_mat -k gene_id -m | gtftk discretize_key -k S1 -d S1_d -n 2 -l A,B | gtftk select_by_key -k feature -v gene
|-- 17:43-INFO-discretize_key : Categories: ['A', 'B']
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S1 "0.2322"; S2 "0.4"; S1_d "A";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S1 "0.999"; S2 "0.6"; S1_d "B";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7"; S1_d "A";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Arguments:
$ gtftk discretize_key -h
Usage: gtftk discretize_key [-i GTF] [-o GTF] -k src_key -d dest_key -n KEY [-l labels] [-p] [-g] [-u] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Create a new key by discretizing a numeric key. This can be helpful to create new classes on the
fly that can be used subsequently.
Notes:
* if ---ft-type is not set the destination key will be assigned to all feature containing the
source key.
* Non-numeric value for source key will be translated into 'NA'.
* The default is to create equally spaced interval. The interval can also be created by
computing the percentiles (-p).
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --src-key The name of the source key (default: None)
-d, --dest-key The name of the target key. (default: None)
-n, --nb-levels The number of levels/classes to create. (default: 2)
-l, --labels A comma separated list of labels of size --nb-levels. (default: None)
-p, --percentiles Compute --nb-levels classes using percentiles. (default: False)
-g, --log Compute breaks based on log-scale. (default: False)
-u, --percentiles-of-uniq Compute percentiles based on non-redondant values. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘selection’¶
select_by_key¶
Description: Extract lines from the gtf based on key and values.
Example: Select some features (genes) then some gene_id.
$ gtftk get_example |gtftk select_by_key -k feature -v gene | gtftk select_by_key -k gene_id -v G0002,G0003,G0004
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003";
chr1 gtftk gene 65 76 . + . gene_id "G0004";
Example: Select gene list in column 1 of file simple_join.txt.
$ gtftk get_example -f join > simple_join.txt ; gtftk get_example| gtftk select_by_key -f simple_join.txt -c 1 -k gene_id | gtftk tabulate -k gene_id -Hun
G0003
G0004
G0009
Example: Select the gene list enclosed in column 1 of file simple_join.txt. Ask for bed format.
$ gtftk get_example -f join > simple_join.txt ; gtftk get_example| gtftk select_by_key -f simple_join.txt -c 1 -k gene_id -b
chr1 49 61 G0003|? . -
chr1 49 61 G0003|G0003T001 . -
chr1 49 54 G0003|G0003T001 . -
chr1 56 61 G0003|G0003T001 . -
chr1 49 52 G0003|G0003T001 . -
chr1 64 76 G0004|? . +
chr1 64 76 G0004|G0004T002 . +
chr1 64 68 G0004|G0004T002 . +
chr1 70 71 G0004|G0004T002 . +
chr1 73 76 G0004|G0004T002 . +
chr1 65 68 G0004|G0004T002 . +
chr1 70 71 G0004|G0004T002 . +
chr1 73 75 G0004|G0004T002 . +
chr1 64 76 G0004|G0004T001 . +
chr1 64 68 G0004|G0004T001 . +
chr1 70 71 G0004|G0004T001 . +
chr1 73 76 G0004|G0004T001 . +
chr1 64 67 G0004|G0004T001 . +
chr1 2 14 G0009|? . -
chr1 2 14 G0009|G0009T002 . -
chr1 2 14 G0009|G0009T002 . -
chr1 4 10 G0009|G0009T002 . -
chr1 2 14 G0009|G0009T001 . -
chr1 2 14 G0009|G0009T001 . -
chr1 2 8 G0009|G0009T001 . -
Example: Select all but genes in column 1 of file simple_join.txt.
$ gtftk get_example -f join > simple_join.txt ; gtftk get_example| gtftk select_by_key -f simple_join.txt -c 1 -k gene_id -n | gtftk tabulate -k gene_id -Hun
G0001
G0002
G0005
G0006
G0007
G0008
G0010
Arguments:
$ gtftk select_by_key -h
Usage: gtftk select_by_key [-i GTF] [-o GTF] [-k KEY] [-v VALUE] [-f FILE] [-c COL] [-n] [-b] [-m NAME] [-s SEP] [-l] [-t] [-g] [-e] [-d] [-a] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select lines from a GTF file based on attributes and associated values.
Version: 2018-01-31
optional arguments:
-v, --value A comma separated list of values. (default: None)
-f, --file-with-values A file containing values as a single column. (default: None)
-t, --select-transcripts A shortcuts for "-k feature -v transcript". (default: False)
-g, --select-genes A shortcuts for "-k feature -v gene". (default: False)
-e, --select-exons A shortcuts for "-k feature -v exon". (default: False)
-d, --select-cds A shortcuts for "-k feature -v CDS". (default: False)
-a, --select-start-codon A shortcuts for "-k feature -v start_codon". (default: False)
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The key name. (default: None)
-c, --col The column number (one-based) that contains the values in the file. File is tab-delimited. (default: 1)
-n, --invert-match Not/invert match. Selected lines whose requested key is not associated with the requested value. (default: False)
-b, --bed-format Ask for bed format output. (default: False)
-m, --names If Bed output. The key(s) that should be used as name. (default: gene_id,transcript_id)
-s, --separator If Bed output. The separator to be used for separating name elements (see -n). (default: |)
-l, --log Print some statistics about selected features. To be used in conjunction with -V 1/2. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_regexp¶
Description: Select lines by testing values of a particular key with a regular expression
Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.
$ gtftk get_example -d mini_real | gtftk select_by_regexp -k gene_name -r "BCL.*" | gtftk tabulate -Hun -k gene_name
BCL2L2-PABPN1
BCL7B
Arguments:
$ gtftk select_by_regexp -h
Usage: gtftk select_by_regexp [-i GTF] [-o GTF] -k KEY -r regexp [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select lines from a GTF file based on a regexp.
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The key name (default: chrom)
-r, --regexp The regular expression. (default: ^chr[0-9XY]+$)
-n, --invert-match Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_intron_size¶
Description: Delete genes containing an intron whose size is below s. If -m is selected, any gene whose sum of intronic region length is above s is deleted. Monoexonic genes are kept.
Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.
$ gtftk get_example -d mini_real | gtftk select_by_regexp -k gene_name -r "BCL.*" | gtftk tabulate -Hun -k gene_name
BCL2L2-PABPN1
BCL7B
Arguments:
$ gtftk select_by_regexp -h
Usage: gtftk select_by_regexp [-i GTF] [-o GTF] -k KEY -r regexp [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select lines from a GTF file based on a regexp.
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The key name (default: chrom)
-r, --regexp The regular expression. (default: ^chr[0-9XY]+$)
-n, --invert-match Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_max_exon_nb¶
Description: For each gene select the transcript with the highest number of exons.
Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.
$ gtftk get_example | gtftk select_by_max_exon_nb | gtftk select_by_key -t
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001";
Arguments:
$ gtftk select_by_max_exon_nb -h
Usage: gtftk select_by_max_exon_nb [-i GTF] [-o GTF] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* For each gene select the transcript with the highest number of exons.
Version: 2018-02-11
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_loc¶
Description: Select transcripts/gene overlapping a given locations. A transcript is defined here as the genomic region from TSS to TTS including introns. This function will return the transcript and all its associated elements (exons, utr,…) even if only a fraction (e.g intron) of the transcript is overlapping the feature. If -/-ft-type is set to ‘gene’ returns the gene and all its associated elements.
Example: Select transcripts at a given location.
$ gtftk get_example | gtftk select_by_key -k feature -v transcript | gtftk select_by_loc -l chr1:10-15
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001";
Arguments:
$ gtftk select_by_loc -h
Usage: gtftk select_by_loc [-i GTF] [-o GTF] (-l LOC | -f BEDFILE) [-t {transcript,gene}] [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select transcripts/gene overlapping a given locations.
Notes:
* A transcript is defined here as the genomic region from TSS to TTS including introns.
* This function will return the transcript and all its associated elements (exons, utr,...)
even if only a fraction (e.g intron) of the transcript is overlapping the feature.
* If -/-ft-type is set to 'gene' returns the gene and all its associated elements.
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-l, --location List of chromosomal locations (chr:start-end[,chr:start-end]). 0-based (default: None)
-f, --location-file Bed file with chromosomal location. (default: None)
-t, --ft-type The feature of interest. (default: transcript)
-n, --invert-match Not/invert match. Select transcript not overlapping. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_nb_exon¶
Description: Select transcripts based on the number of exons.
Example:
$ gtftk get_example | gtftk select_by_nb_exon -m 2 | gtftk nb_exons| gtftk select_by_key -t
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; nb_exons "2";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; nb_exons "3";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001"; nb_exons "3";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001"; nb_exons "2";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001"; nb_exons "3";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002"; nb_exons "2";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001"; nb_exons "2";
Arguments:
$ gtftk select_by_nb_exon -h
Usage: gtftk select_by_nb_exon [-i GTF] [-o GTF] [-m min_exon_number] [-M max_exon_number] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select transcripts based on the number of exons.
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-m, --min-exon-number Minimum number of exons. (default: 0)
-M, --max-exon-number Maximum number of exons. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_numeric_value¶
Description: Select lines from a GTF file based on a boolean test on numeric values.
Example:
$ gtftk join_attr -i simple.gtf -j simple.join_mat -k gene_id -m| gtftk select_by_numeric_value -t 'start < 10 and end > 10 and S1 == 0.5555 and S2 == 0.7' -n ".,?"
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; S1 "0.5555"; S2 "0.7";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001"; S1 "0.5555"; S2 "0.7";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; S1 "0.5555"; S2 "0.7";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001"; S1 "0.5555"; S2 "0.7";
Arguments:
$ gtftk select_by_numeric_value -h
Usage: gtftk select_by_numeric_value [-i GTF] [-o GTF] -t test [-n na_omit] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select lines from a GTF file based on a boolean test on numeric values.
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-t, --test The test to be applied. (default: None)
-n, --na-omit If one of the evaluated values is enclosed in this list (csv), line is skipped. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
random_list¶
Description: Select a random list of genes or transcripts.
Example: Select randomly 3 transcripts.
$ gtftk get_example | gtftk random_list -n 3| gtftk count
transcript 3
exon 6
CDS 5
Arguments:
$ gtftk random_list -h
Usage: gtftk random_list [-i GTF] [-o GTF] [-n NUMBER] [-t {gene,transcript}] [-s SEED] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select a random list of genes or transcripts. Note that if transcripts are requested the 'gene'
feature is not returned.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-n, --number The number of transcripts or gene to select. (default: 1)
-t, --ft-type The type of feature. (default: transcript)
-s, --seed-value Seed value for the random number generator. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
random_tx¶
Description: Select randomly up to m transcript for each gene.
Example: Select randomly 1 transcript per gene (-m 1).
$ gtftk get_example | gtftk random_tx -m 1| gtftk select_by_key -k feature -v gene,transcript| gtftk tabulate -k gene_id,transcript_id
gene_id transcript_id
G0001 G0001T001
G0002 G0002T001
G0003 G0003T001
G0004 G0004T002
G0005 G0005T001
G0006 G0006T001
G0007 G0007T001
G0008 G0008T001
G0009 G0009T001
G0010 G0010T001
Arguments:
$ gtftk random_tx -h
Usage: gtftk random_tx [-i GTF] [-o GTF] [-m MAX] [-s SEED] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select randomly up to m transcript for each gene.
Version: 2018-01-30
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-m, --max-transcript The maximum number of transcripts to select for each gene. (default: 1)
-s, --seed-value Seed value for the random number generator. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
rm_dup_tss¶
Description: If several transcripts of a gene share the same tss, select only one.
Example: Use rm_dup_tss to select transcripts that will be used for mk_matrix -k 5 (see later).
$ gtftk get_example | gtftk rm_dup_tss| gtftk select_by_key -k feature -v transcript
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001";
Arguments:
$ gtftk rm_dup_tss -h
Usage: gtftk rm_dup_tss [-i GTF] [-o GTF] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* If several transcripts of a gene share the same TSS, select one transcript per TSS.
Notes:
* The alphanumeric order of transcript_id is used to select the representative of a TSS.
Version: 2018-01-20
Argument:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_go¶
Description: Select genes from a GTF file using a Gene Ontology ID (e.g GO:0050789).
Example: Select genes with transcription factor activity from the GTF. They could be used subsequently to test their epigenetic features (see later).
$ gtftk get_example -d mini_real -f gtf| gtftk select_by_go -s hsapiens | gtftk select_by_key -k feature -v gene | gtftk tabulate -k gene_id,gene_name -Hun | head -6
ENSG00000142611 PRDM16
ENSG00000069812 HES2
ENSG00000127124 HIVEP3
ENSG00000162367 TAL1
ENSG00000143006 DMRTB1
ENSG00000054267 ARID4B
Arguments:
$ gtftk select_by_go -h
Usage: gtftk select_by_go [-i GTF] [-o GTF] [-g go_id] (-l | -s species) [-n] [-p1 http_proxy] [-p2 https_proxy] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select lines/genes from a GTF file using a Gene Ontology ID (e.g GO:0097194).
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-g, --go-id The GO ID (with or without "GO:" prefix). (default: GO:0003700)
-l, --list-datasets Do not select lines. Only get a list of available datasets/species. (default: False)
-s, --species The dataset/species. (default: None)
-n, --invert-match Not/invert match. (default: False)
-p1, --http-proxy Use this http proxy (not tested/experimental). (default: )
-p2, --https-proxy Use this https proxy (not tested/experimental). (default: )
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_by_tx_size¶
Description: Select transcript based on their size (i.e size of mature/spliced transcript).
Example:
$ gtftk get_example | gtftk feature_size -t mature_rna | gtftk select_by_tx_size -m 14 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
gene_id transcript_id feat_size
G0001 G0001T002 14
G0001 G0001T001 14
$ gtftk get_example | gtftk feature_size -t mature_rna | gtftk select_by_tx_size -m 11 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
gene_id transcript_id feat_size
G0001 G0001T002 14
G0001 G0001T001 14
G0009 G0009T002 12
G0009 G0009T001 12
G0010 G0010T001 11
$ gtftk get_example -d mini_real | gtftk feature_size -t mature_rna | gtftk select_by_tx_size -m 8000 -M 1000000000 | gtftk tabulate -n -k gene_id,transcript_id,feat_size -H | sort -k3,3n | tail -n 10
ENSG00000215182 ENST00000621226 17448
ENSG00000127603 ENST00000361689 17538
ENSG00000127603 ENST00000289893 19141
ENSG00000173517 ENST00000560626 19217
ENSG00000229807 ENST00000429829 19275
ENSG00000129682 ENST00000315930 19519
ENSG00000280383 ENST00000623075 23112
ENSG00000127603 ENST00000372915 23440
ENSG00000127603 ENST00000567887 24319
ENSG00000127603 ENST00000564288 24828
Arguments:
$ gtftk select_by_tx_size -h
Usage: gtftk select_by_tx_size [-i GTF] [-o GTF] [-m min_size] [-M max_size] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select transcript based on their size (i.e size of mature/spliced transcript).
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-m, --min-size Minimum size. (default: 0)
-M, --max-size Maximum size. (default: 1000000000)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
select_most_5p_tx¶
Description: Select the most 5’ transcript of each gene.
Example:
$ gtftk get_example | gtftk select_most_5p_tx | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id transcript_id
G0001 G0001T002
G0002 G0002T001
G0003 G0003T001
G0004 G0004T002
G0005 G0005T001
G0006 G0006T001
G0007 G0007T001
G0008 G0008T001
G0009 G0009T002
G0010 G0010T001
Arguments:
$ gtftk select_most_5p_tx -h
Usage: gtftk select_most_5p_tx [-i GTF] [-o GTF] [-g] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select the most 5' transcript of each gene.
Notes:
* If several transcript share the samemost 5' TSS, only one transcript is selected.
Version: 2018-01-20
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-g, --keep-gene-lines Add gene lines to the output (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
short_long¶
Description: Get the shortest or longest transcript of each gene
Example:
$ gtftk get_example | gtftk short_long | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id transcript_id
G0001 G0001T002
G0002 G0002T001
G0003 G0003T001
G0004 G0004T002
G0005 G0005T001
G0006 G0006T002
G0007 G0007T001
G0008 G0008T001
G0009 G0009T002
G0010 G0010T001
Arguments:
$ gtftk short_long -h
Usage: gtftk short_long [-i GTF] [-o GTF] [-l] [-g] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select the shortest mature transcript (i.e without introns) for each gene or the longest if the -l
arguments is used.
Notes:
*
Version: 2018-01-25
Argument:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-l, --longs Take the longest transcript of each gene (default: False)
-g, --keep-gene-lines Add gene lines to the output (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘convertion’¶
convert¶
Description: This command can be used to convert to various formats. Currently only a limited number is supported.
- bed: classical bed6 format.
- bed6: classical bed6 format.
- bed3: bed3 format.
Example: Get the gene features and convert them to bed6.
$ gtftk get_example | gtftk select_by_key -k feature -v gene | gtftk convert -n gene_id | head -n 3
chr1 124 138 G0001 . +
chr1 179 189 G0002 . +
chr1 49 61 G0003 . -
Example: Get the gene features and convert them to bed3.
$ gtftk get_example | gtftk select_by_key -k feature -v gene | gtftk convert -f bed3 | head -n 3
chr1 124 138
chr1 179 189
chr1 49 61
Example: Get the exonic features and convert them to bed3.
$ gtftk get_example | gtftk select_by_key -k feature -v exon | gtftk convert -n gene_id,transcript_id,exon_id | head -3
chr1 124 138 G0001|G0001T002|G0001T002E001 . +
chr1 124 138 G0001|G0001T001|G0001T001E001 . +
chr1 179 189 G0002|G0002T001|G0002T001E001 . +
Arguments:
$ gtftk convert -h
Usage: gtftk convert [-i GTF] [-o BED/BED3/BED6] [-n NAME] [-s SEP] [-m more_names] [-f {bed,bed3,bed6}] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Convert a GTF to various format (still limited).
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-n, --names The key(s) that should be used as name. (default: gene_id,transcript_id)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
-m, --more-names Add this information to the 'name' column of the BED file. (default: )
-f, --format Currently one of bed3, bed6 (default: bed6)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
tabulate¶
Description: Extract key/values from the GTF and convert them to tabulated format. When requesting coordinates they will be provided in 1-based format.
Example: Simply get the list of transcripts and gene.
$ gtftk get_example -f gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id -s "|"
gene_id|transcript_id
G0001|G0001T002
G0001|G0001T001
G0002|G0002T001
G0003|G0003T001
G0004|G0004T002
G0004|G0004T001
G0005|G0005T001
G0006|G0006T001
G0006|G0006T002
G0007|G0007T001
G0007|G0007T002
G0008|G0008T001
G0009|G0009T002
G0009|G0009T001
G0010|G0010T001
Example: Join novel attributes (see join_attr examples) and convert the resulting GTF stream to tab format
$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join.txt -n a_score -t gene| gtftk select_by_key -k feature -v gene| gtftk tabulate -k feature,start,end,seqid,gene_id,a_score
feature start end seqid gene_id a_score
gene 50 61 chr1 G0003 0.2322
gene 65 76 chr1 G0004 0.999
gene 3 14 chr1 G0009 0.5555
Example: You may also delete the header, ask for non redondant lines and delete any lines containing not-available values (‘.’).
$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join.txt -n a_score -t gene| gtftk select_by_key -k feature -v gene| gtftk tabulate -k feature,start,end,seqid,gene_id,a_score -Hun
gene 50 61 chr1 G0003 0.2322
gene 65 76 chr1 G0004 0.999
gene 3 14 chr1 G0009 0.5555
Arguments:
$ gtftk tabulate -h
Usage: gtftk tabulate [-i GTF] [-o TXT] [-s SEPARATOR] [-k KEY,KEY,...] [-u] [-H] [-n] [-x] [-b] [-t | -g | -a | -e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Convert a GTF to tabulated format.
Notes:
* To refer to default keys use: seqid,source,feature,start,end,frame,gene_id...
* Note that 'all' or '*' are special keys that can be used to convert the whole GTF into a
tabulated file. Thanks @fafa13.
Version: 2018-01-20
optional arguments:
-t, --select-transcript-ids A shortcuts for "-k transcript_id". (default: False)
-g, --select-gene_ids A shortcuts for "-k gene_id". (default: False)
-a, --select-gene-names A shortcuts for "-k gene_name". (default: False)
-e, --select-exon-ids A shortcuts for "-k exon_ids". (default: False)
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --separator The output field separator. (default: )
-k, --key A comma separated list of key names. (default: *)
-u, --unique Print a non redondant list of lines. (default: False)
-H, --no-header Don't print the header line. (default: False)
-n, --no-unset Don't print lines containing '.' (unsetined values) (default: False)
-x, --accept-undef Print line for which the key is undefined (i.e, '?', does not exists). (default: False)
-b, --no-basic In case key is set to 'all' or '*', don't write basic attributes. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
bed_to_gtf¶
Description: Convert a bed file to gtf-like format.
Example:
$ gtftk get_example |gtftk convert| gtftk bed_to_gtf -t transcript | head -n 5
chr1 Unknown transcript 125 138 . + . gene_id "G0001|?"; transcript_id "G0001|?";
chr1 Unknown transcript 125 138 . + . gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1 Unknown transcript 125 138 . + . gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1 Unknown transcript 125 130 . + . gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1 Unknown transcript 125 138 . + . gene_id "G0001|G0001T001"; transcript_id "G0001|G0001T001";
Arguments:
$ gtftk bed_to_gtf -h
Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
lots of empty fields...sniff). May be helpful sometimes...
Version: 2018-02-11
Arguments:
-i, --inputfile Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-t, --ft-type The type of features you are trying to mimic... (default: transcript)
-s, --source The source of annotation. (default: Unknown)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
convert_ensembl¶
Description: Convert the GTF file to ensembl format. Essentially add ‘transcript’/’gene’ features.
Example: Delete gene and transcript feature. Regenerate them.
$ gtftk get_example | gtftk select_by_key -k feature -v gene,transcript -n| gtftk convert_ensembl | gtftk select_by_key -k gene_id -v G0001
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1 gtftk CDS 125 130 . + . gene_id "G0001"; transcript_id "G0001T002"; ccds_id "CDS_G0001T002";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1 gtftk CDS 130 132 . + . gene_id "G0001"; transcript_id "G0001T001"; ccds_id "CDS_G0001T001";
Arguments:
$ gtftk bed_to_gtf -h
Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
lots of empty fields...sniff). May be helpful sometimes...
Version: 2018-02-11
Arguments:
-i, --inputfile Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-t, --ft-type The type of features you are trying to mimic... (default: transcript)
-s, --source The source of annotation. (default: Unknown)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘annotation’¶
closest_genes¶
Description: Find the n closest genes for each transcript.
Example:
$ gtftk get_example | bedtools sort | gtftk closest_genes -f
genes closest_genes distances
G0009 G0006 21
G0006 G0005 12
G0005 G0006 12
G0003 G0004 4
G0004 G0003 4
G0007 G0001 18
G0001 G0007 18
G0010 G0002 4
G0002 G0010 4
G0008 G0002 42
Arguments:
$ gtftk closest_genes -h
Usage: gtftk closest_genes [-i GTF] [-o GTF/TXT] [-r {tss,tts,gene}] [-nb nb_neighbors] [-t {tss,tts,gene}] [-s] [-S] [-f] [-H] [-k] [-id {gene_id,gene_name}] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Find the n closest genes for each genes.
Notes:
* The reference region for each gene can be the TSS (the most 5'), the TTS (The most 3') or
the whole gene.
* The reference region for each closest gene can be the TSS, the whole gene or the TTS.
* The closest genes can be searched in a stranded or unstranded fashion.
Version: 2018-02-11
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Arguments:
-r, --from-region-type What is region to consider for each gene. (default: tss)
-nb, --nb-neighbors The size of the neighborhood. (default: 1)
-t, --to-region-type What is region to consider for each closest gene. (default: tss)
-s, --same-strandedness Require same strandedness (default: False)
-S, --diff-strandedness Require different strandedness (default: False)
-f, --text-format Return a text format. (default: False)
-H, --no-header Don't print the header line. (default: False)
-k, --collapse Unwrap. Don't use comma. Print closest genes line by line. (default: False)
-id, --identifier The key used as gene identifier. (default: gene_id)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
overlapping¶
Description: Find transcripts whose body/TSS/TTS region extended in 5’ and 3’ (-u/-d) overlaps with any transcript from another gene. Strandness is not considered by default. Used –invert-match to find those that do not overlap. If –annotate-gtf is used, all lines of the input GTF file will be printed and a new key containing the list of overlapping transcripts will be added to the transcript features/lines (key will be ‘overlapping_*’ with * one of body/TSS/TTS). The –annotate-gtf and –invert-match arguments are mutually exclusive.
Example: Find transcript whose promoter overlap transcript from other genes.
$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt; gtftk get_example | gtftk overlapping -c simple_join_chromInfo.txt -t promoter -u 10 -d 10 -a | gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,overlap_promoter_u0.01k_d0.01k | head
transcript_id overlap_promoter_u0.01k_d0.01k
G0001T002 G0007T001,G0007T002
G0001T001 G0007T001,G0007T002
G0002T001 G0010T001
G0003T001 G0004T002,G0004T001
G0004T002 G0003T001
G0004T001 G0003T001
G0005T001 G0003T001
G0006T001 G0005T001
G0006T002 G0005T001
Example: Find transcript whose tts overlap transcript from other genes (on the other strand).
$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt; gtftk get_example | gtftk overlapping -c simple_join_chromInfo.txt -t tts -u 30 -d 30 -a -S | gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,overlap_tts_u0.03k_d0.03k | head
transcript_id overlap_tts_u0.03k_d0.03k
G0002T001 G0008T001
G0003T001 G0004T002,G0004T001
G0004T002 G0003T001,G0005T001
G0004T001 G0003T001,G0005T001
G0008T001 G0002T001,G0010T001
G0010T001 G0008T001
Arguments:
$ gtftk overlapping -h
Usage: gtftk overlapping [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-t {transcript,promoter,tts}] [-s] [-S] [-n] [-a] [-k key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Find transcripts whose body/TSS/TTS region extended in 5' and 3' (-u/-d) overlaps with any
transcript from another gene. Strandness is not considered by default. Used --invert-match to
find those that do not overlap. If --annotate-gtf is used, all lines of the input GTF file will
be printed and a new key containing the list of overlapping transcripts will be added to the
transcript features/lines (key will be 'overlapping_*' with * one of body/TSS/TTS). The
--annotate-gtf and --invert-match arguments are mutually exclusive.
Version: 2018-01-24
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Chromosome information. A tabulated two-columns file with chromosomes as column 1 and sizes as column 2 (default: None)
-u, --upstream Extend the region in 5' by a given value (int). Used to define the region around the TSS/TTS. (default: 1500)
-d, --downstream Extend the region in 3' by a given value (int). Used to define the region around the TSS/TTS. (default: 1500)
-t, --feature-type The feature of interest. (default: transcript)
-s, --same-strandedness Require same strandedness (default: False)
-S, --diff-strandedness Require different strandedness (default: False)
-n, --invert-match Not/Invert match. (default: False)
-a, --annotate-gtf All lines of the original GTF will be printed. (default: False)
-k, --key-name The name of the key. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
divergent¶
Description: Find transcript with divergent promoters. These transcripts will be defined here as those whose promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as described in “Divergent transcription is associated with promoters of transcriptional regulators” (Lepoivre C, BMC Genomics, 2013). The ouput is a GTF with an additional key (‘divergent’) whose value is set to ‘.’ if the gene has no antisens transcript in its promoter region. If the gene has an antisens transcript in its promoter region the ‘divergent’ key is set to the identifier of the transcript whose tss is the closest relative to the considered promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).
Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.
$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt; gtftk get_example | gtftk divergent -c simple_join_chromInfo.txt -u 10 -d 10| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,divergent,dist_to_divergent | head -n 7
transcript_id divergent dist_to_divergent
G0003T001 G0004T002 4.0
G0004T002 G0003T001 4.0
G0004T001 G0003T001 4.0
Arguments:
$ gtftk divergent -h
Usage: gtftk divergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-n] [-S] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Find transcripts with divergent promoters. These transcripts will be defined here as those whose
promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens
orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as
described in "Divergent transcription is associated with promoters of transcriptional
regulators" (Lepoivre C, BMC Genomics, 2013). The ouput is a GTF with an additional key
('divergent') whose value is set to '.' if the gene has no antisens transcript in its promoter
region. If the gene has an antisens transcript in its promoter region the 'divergent' key is
set to the identifier of the transcript whose tss is the closest relative to the considered
promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).
Version: 2018-01-24
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
-u, --upstream Extend the promoter in 5' by a given value (int). Defines the region around the tss. (default: 1500)
-d, --downstream Extend the region in 3' by a given value (int). Defines the region around the tss. (default: 1500)
-n, --no-annotation Do not annotate the GTF. Just select the divergent transcripts. (default: False)
-S, --no-strandness Do not consider strandness (only look whether the promoter from a transcript overlap with the promoter from another gene). (default: False)
-a, --key-name The name of the key. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
convergent¶
Description: Find transcript with convergent tts. These transcripts will be defined here as those whose tts region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens orientation. The ouput is a GTF with an additional key (‘convergent’) whose value is set to ‘.’ if the gene has no convergent transcript in its tts region. If the gene has an antisens transcript in its tts region the ‘convergent’ key is set to the identifier of the transcript whose tts is the closest relative to the considered tts. The tts to tts distance is also provided as an additional key (dist_to_convergent).
Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.
$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt; gtftk get_example | gtftk convergent -c simple_join_chromInfo.txt -u 25 -d 25| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,convergent,dist_to_convergent| head -n 4
transcript_id convergent dist_to_convergent
G0002T001 G0008T001 21.0
G0008T001 G0002T001 21.0
G0010T001 G0008T001 24.0
Arguments:
$ gtftk convergent -h
Usage: gtftk convergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Find transcripts with convergent tts. These transcripts will be defined here as those whose tts
region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens
orientation. The ouput is a GTF with an additional key ('convergent') whose value is set to '.'
if the gene has no convergent transcript in its tts region. If the gene has an antisens
transcript in its tts region the 'convergent' key is set to the identifier of the transcript
whose tts is the closest relative to the considered tts. The tts to tts distance is also
provided as an additional key (dist_to_convergent).
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
-u, --upstream Extend the tts in 5' by a given value (int). Defines the region around the tts. (default: 1500)
-d, --downstream Extend the region in 3' by a given value (int). Defines the region around the tts. (default: 1500)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
exon_sizes¶
Description: Add a new key to transcript features containing a comma separated list of exon sizes.
Example:
$ gtftk get_example | gtftk exon_sizes | gtftk select_by_key -t
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_sizes "14";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_sizes "14";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; exon_sizes "10";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_sizes "5,5";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_sizes "4,1,3";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_sizes "4,1,3";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001"; exon_sizes "6,3";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_sizes "3,3,4";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002"; exon_sizes "3,3";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001"; exon_sizes "10";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T002"; exon_sizes "10";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001"; exon_sizes "3,5";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; exon_sizes "12";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; exon_sizes "12";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001"; exon_sizes "11";
Arguments:
$ gtftk exon_sizes -h
Usage: gtftk exon_sizes [-i GTF] [-o TXT] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Add a new key to transcript features containing a comma separated list of exon sizes.
Version: 2018-01-24
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output GTF file. (default: <stdout>)
-a, --key-name The name of the key. (default: exon_sizes)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
intron_sizes¶
Description: Add a new key to transcript features containing a comma separated list of intron sizes.
Example:
$ gtftk get_example | gtftk intron_sizes | gtftk select_by_key -t
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; intron_sizes "0";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; intron_sizes "0";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; intron_sizes "0";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; intron_sizes "2";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; intron_sizes "2,2";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001"; intron_sizes "2,2";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001"; intron_sizes "6";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001"; intron_sizes "2,2";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002"; intron_sizes "2";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001"; intron_sizes "0";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T002"; intron_sizes "0";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001"; intron_sizes "5";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; intron_sizes "0";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; intron_sizes "0";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001"; intron_sizes "0";
Arguments:
$ gtftk intron_sizes -h
Usage: gtftk intron_sizes [-i GTF] [-o TXT] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Add a new key to transcript features containing a comma separated list of intron-size.
Version: 2018-01-24
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-a, --key-name The name of the key. (default: intron_sizes)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘coordinates’¶
midpoints¶
Description: Get the genomic midpoint of each features: genes, transcripts, exons or introns. Output is currently in bed format only.
Example: Get mipoints of all transcripts and exons.
$ gtftk get_example | gtftk midpoints -t transcript,exon -n transcript_id,feature | head -n 5
chr1 7 9 G0009T002|transcript . -
chr1 7 9 G0009T001|exon . -
chr1 7 9 G0009T001|transcript . -
chr1 7 9 G0009T002|exon . -
chr1 27 29 G0006T001|transcript . -
Arguments:
$ gtftk midpoints -h
Usage: gtftk midpoints [-i GTF/BED] [-o BED] [-t ft_type] [-n NAME] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get the midpoint coordinates for the requested feature. Output is bed format.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-t, --ft-type The target feature (as found in the 3rd column of the GTF). (default: transcript)
-n, --names The key(s) that should be used as name. (default: transcript_id)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
5p_3p_coords¶
Description: Get the 5p or 3p coordinates for each feature (e.g TSS or TTS for a transcript). Output is bed format.
Example: Get the 5p ends of transcripts and exons.
$ gtftk get_example | gtftk get_5p_3p_coords -t transcript,exon -n transcript_id,gene_id,feature | head -n 5
chr1 124 125 G0001T002|G0001|transcript . +
chr1 124 125 G0001T002|G0001|exon . +
chr1 124 125 G0001T001|G0001|transcript . +
chr1 124 125 G0001T001|G0001|exon . +
chr1 179 180 G0002T001|G0002|transcript . +
Example: Get the 3p ends of transcripts and exons.
$ gtftk get_example | gtftk get_5p_3p_coords -t transcript,exon -n transcript_id,gene_id,feature -v -s "^"| head -n 5
chr1 137 138 G0001T002^G0001^transcript . +
chr1 137 138 G0001T002^G0001^exon . +
chr1 137 138 G0001T001^G0001^transcript . +
chr1 137 138 G0001T001^G0001^exon . +
chr1 188 189 G0002T001^G0002^transcript . +
Arguments:
$ gtftk get_5p_3p_coords -h
Usage: gtftk get_5p_3p_coords [-i GTF] [-o BED] [-t ft_type] [-v] [-p transpose] [-n NAME] [-m more_names] [-s SEP] [-e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get the 5p or 3p coordinate for each feature (e.g TSS or TTS for a transcript).
Notes:
* Output is in BED format.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-t, --ft-type The target feature (as found in the 3rd column of the GTF). (default: transcript)
-v, --invert Get 3' coordinate. (default: False)
-p, --transpose Transpose coordinate in 5' (use negative value) or in 3' (use positive values). (default: 0)
-n, --names The key(s) that should be used as name. (default: gene_id,transcript_id)
-m, --more-names A comma separated list of information to be added to the 'name' column of the bed file. (default: None)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
-e, --explicit Write explicitly the name of the keys in the header. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
intergenic¶
Description: Extract intergenic regions. This command requires a chromInfo file to compute the bed file boundaries. The command will print the coordinates of genomic regions without transcript features.
Example: Simply get intergenic regions.
$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt; gtftk get_example | gtftk intergenic -c simple_join_chromInfo.txt
chr1 0 2 region_1 0 .
chr1 14 21 region_2 0 .
chr1 47 49 region_3 0 .
chr1 61 64 region_4 0 .
chr1 76 106 region_5 0 .
chr1 116 124 region_6 0 .
chr1 138 175 region_7 0 .
chr1 189 209 region_8 0 .
chr1 222 300 region_9 0 .
chr2 0 600 region_10 0 .
Arguments:
$ gtftk intergenic -h
Usage: gtftk intergenic [-i GTF] [-o BED] -c CHROMINFO [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Extract intergenic regions. This command requires a chromInfo file to compute the bed file
boundaries. The command will print the coordinates of genomic regions without any transcript
features.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
intronic¶
Description: Returns a bed file containing the intronic regions. If by_transcript is false (default), returns merged genic regions with no exonic overlap (“strict” mode). Otherwise, the intronic regions corresponding to each transcript are returned (may contain exonic overlap and redundancy).
Example: Simply get intronic regions.
$ gtftk get_example | gtftk intronic | head -n 5
chr1 25 27
chr1 30 32
chr1 35 41
chr1 54 56
chr1 68 70
Example: Intronic regions of each transcript.
$ gtftk get_example | gtftk intronic -b
chr1 54 56 intron|G0003|G0003T001 1 -
chr1 68 70 intron|G0004|G0004T002 1 +
chr1 71 73 intron|G0004|G0004T002 2 +
chr1 68 70 intron|G0004|G0004T001 1 +
chr1 71 73 intron|G0004|G0004T001 2 +
chr1 35 41 intron|G0005|G0005T001 1 -
chr1 25 27 intron|G0006|G0006T001 2 -
chr1 30 32 intron|G0006|G0006T001 1 -
chr1 30 32 intron|G0006|G0006T002 1 -
chr1 214 219 intron|G0008|G0008T001 1 -
Arguments:
$ gtftk intronic -h
Usage: gtftk intronic [-i GTF] [-o BED] [-b] [-n NAME] [-s SEP] [-w] [-F] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Returns a bed file containing the intronic regions. If by_transcript is false (default), returns
merged genic regions with no exonic overlap ("strict" mode). Otherwise, the intronic regions
corresponding to each transcript are returned (may contain exonic overlap and redundancy).
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-b, --by-transcript The intronic regions are returned for each transcript. (default: False)
-n, --names The key(s) that should be used as name (if -b is used). (default: gene_id,transcript_id)
-s, --separator The separator to be used for separating name elements (if -b is used). (default: |)
-w, --intron-nb-in-name By default intron number is written in 'score' column. Force it to be written in 'name' column. transcript. (default: False)
-F, --no-feature-name Don't add the feature name ('intron') in the name column. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
splicing_site¶
Description: Compute the locations of donor and acceptor splice sites. This command will return a single position which corresponds to the most 5’ and/or the most 3’ intronic region. If the gtf file does not contain exon numbering you can compute it using the add_exon_nb command. The score column of the bed file contain the number of the closest exon relative to the splice site.
Example:
$ gtftk get_example | gtftk add_exon_nb -k exon_nbr | gtftk splicing_site -k exon_nbr| head
chr1 54 55 acceptor|G0003T001E001|G0003T001|G0003 2 -
chr1 55 56 donor|G0003T001E002|G0003T001|G0003 1 -
chr1 68 69 donor|G0004T002E001|G0004T002|G0004 1 +
chr1 71 72 donor|G0004T002E002|G0004T002|G0004 2 +
chr1 69 70 acceptor|G0004T002E002|G0004T002|G0004 2 +
chr1 72 73 acceptor|G0004T002E003|G0004T002|G0004 3 +
chr1 68 69 donor|G0004T001E001|G0004T001|G0004 1 +
chr1 71 72 donor|G0004T001E002|G0004T001|G0004 2 +
chr1 69 70 acceptor|G0004T001E002|G0004T001|G0004 2 +
chr1 72 73 acceptor|G0004T001E003|G0004T001|G0004 3 +
Arguments:
$ gtftk splicing_site -h
Usage: gtftk splicing_site [-i GTF] [-o BED] [-k exon_numbering_key] [-n NAME] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Compute the locations of donor and acceptor splice sites.
Notes:
* This will return a single position which corresponds to the most 5' and/or the most 3'
intronic region. If the gtf file does not contain exon numbering you can compute it using the
add_exon_nb command. The score column of the bed file contain the number of the closest exon
relative to the splice site.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --exon-numbering-key The name of the key containing the exon numbering (exon_number in ensembl) (default: exon_number)
-n, --names The key(s) that should be used as name. (default: exon_id,transcript_id,gene_id)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
shift¶
Description: Shift coordinates in 3’ or 5’ direction.
Example:
$ gtftk get_example| head -n 1
chr1 gtftk gene 125 138 . + . gene_id "G0001";
$ gtftk get_example -f chromInfo > simple.chromInfo; gtftk get_example | gtftk shift -s -10 -c simple.chromInfo | head -n 1
chr1 gtftk gene 115 128 . + . gene_id "G0001";
Arguments:
$ gtftk shift -h
Usage: gtftk shift [-i GTF] [-o GTF] -s shift_value [-d] [-a] -c CHROMINFO [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Transpose coordinates in 3' or 5' direction.
Notes:
* By default shift is not strand specific. Meaning that if --shift-value is set to 10, all
coordinates will be moved 10 bases in 5' direction relative to the forward/watson/plus/top
strand.
* Use a negative value to shift in 3' direction, a positive value to shift in 5' direction.
* If --stranded is true, features are transposed in 5' direction relative to their associated
strand.
* By default, features are not allowed to go outside the genome coordinates. In the current
implementation, in case this would happen (using a very large --shift-value), feature would
accumulate at the ends of chromosomes irrespectively of gene or transcript structures giving
rise, ultimately, to several exons from the same transcript having the same starts or ends.
* One can forced features to go outside the genome and ultimatly dissapear with large --shift-
value by using -a.
Version: 2018-01-20
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --shift-value Shift coordinate by s nucleotides. (default: 0)
-d, --stranded By default shift not . (default: False)
-a, --allow-outside Accept the partial or total disappearance of a feature upon shifting. (default: False)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘sequence’¶
get_tx_seq¶
Description: Get transcript sequences in fasta format.
Example: Get sequences of transcripts in 5’ to 3’ orientation
$ gtftk get_example -f fa > simple.fa; gtftk get_example | gtftk get_tx_seq -g simple.fa | head -n 4
>transcript|G0001T002|G0001|chr1|125|138
cccccgttacgtag
>transcript|G0001T001|G0001|chr1|125|138
cccccgttacgtag
Note that the format is rather flexible and any combination of key can be exported to the header.
$ gtftk get_example | gtftk get_tx_seq -g simple.fa -l gene_id,transcript_id,feature,chrom,start,end,strand | head -n 2
>G0001|G0001T002|transcript|chr1|125|138|+
cccccgttacgtag
You can ask to add explicitly (-e) the name of the keys in the header. Here we also add the size of the mature transcript and the number of exons.
$ gtftk get_example | gtftk feature_size -t mature_rna | gtftk nb_exons| gtftk get_tx_seq -g simple.fa -l feature,transcript_id,gene_id,seqid,start,end,feat_size,nb_exons -e | head -n 2
>feature=transcript|transcript_id=G0001T002|gene_id=G0001|seqid=chr1|start=125|end=138|feat_size=14|nb_exons=1
cccccgttacgtag
You may use wildcard (path enclosed within quotes) in case the genome is splitted in several chromosome files:
$ gtftk get_example | gtftk get_tx_seq -g '*.fa' -l gene_id,transcript_id,feature,chrom,start,end,strand -s "," | head -n 2
>G0001,G0001T002,transcript,chr1,125,138,+
cccccgttacgtag
A particular header format that should be compliant with sleuth is also proposed.
$ gtftk get_example | gtftk get_tx_seq -g '*.fa' -f -n | head -n 2
>G0001T002 chromosome:GRCm38:chr1:125:138:1 gene:G0001 gene_biotype:? transcript_biotype:?
cccccgttacgtag
Arguments:
$ gtftk get_tx_seq -h
Usage: gtftk get_tx_seq [-i GTF] [-o FASTA] -g FASTA [-w] [-s SEP] [-l label] [-f] [-d] [-a assembly] [-c] [-n] [-e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get transcripts sequences in a flexible fasta format from a GTF file.
Notes:
* The sequences are returned in 5' to 3' orientation.
* If you want to use wildcards, use quotes :e.g. 'foo/bar*.fa'.
Version:
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output FASTA file. (default: <stdout>)
-g, --genome The genome in fasta format. Accept path with wildcards (e.g. *.fa). (default: None)
-w, --with-introns Set to true to include intronic regions. (default: False)
-s, --separator To separate info in header. (default: |)
-l, --label A set of key for the header. (default: feature,transcript_id,gene_id,seqid,start,end)
-f, --sleuth-format Produce output in sleuth format. (default: False)
-d, --delete-version In case of --sleuth-format, delete gene_id or transcript_id version number (e.g '.2' in ENSG56765.2). (default: False)
-a, --assembly In case of --sleuth-format, an assembly version. (default: GRCm38)
-c, --del-chr When using --sleuth-format delete 'chr' in sequence id. (default: False)
-n, --no-rev-comp Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
-e, --explicit Write explicitly the name of the keys in the header. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
get_feat_seq¶
Description: Get feature sequence (e.g exon, UTR…).
Example:
$ gtftk get_feat_seq -i simple.gtf -g simple.fa -l feature,transcript_id,start -t exon -n | head -10
>G0001T002|125|exon|125|138
cccccgttacgtag
>G0001T001|125|exon|125|138
cccccgttacgtag
>G0002T001|180|exon|180|189
ggccttatta
>G0003T001|50|exon|50|54
caagc
>G0003T001|50|exon|57|61
taatt
Arguments:
$ gtftk get_feat_seq -h
Usage: gtftk get_feat_seq [-i GTF] [-o FASTA] -g genome [-s separator] [-l label] [-t feature_type] [-n] [-e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Get feature sequences in a flexible fasta format from a GTF file.
Notes:
* The sequences are returned in 5' to 3' orientation.
* If you want to use wildcards, use quotes :e.g. 'foo/bar*.fa'.
Version:
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output FASTA file. (default: <stdout>)
-g, --genome The genome in fasta format. Accept path with wildcards (e.g. *.fa). (default: None)
-s, --separator To separate info in header. (default: |)
-l, --label A set of key for the header that will be extracted from the transcript line. (default: feature,transcript_id,gene_id,seqid,start,end)
-t, --feature-type The feature type (one defined in column 3). (default: exon)
-n, --no-rev-comp Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
-e, --explicit Write explicitly the name of the keys in the header. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘coverage’¶
coverage¶
Description: Takes a GTF as input to compute bigwig coverage in regions of interest (promoter, transcript body, intron, intron_by_tx, tts…) or a BED6 to focus on user-defined regions. If –n-highest is used the program will compute the coverage of each bigwig based on the average value of the n windows (–nb-window) with the highest coverage values. Regions were signal can be computed (if GTF file as input) are promoter, tts, introns, intergenic regions or any feature available in the GTF file (transcript, exon, gene…). If –matrix-out is selected, the signal for each bigwig will be provided in a dedicated column. Otherwise, signal for each bigwig is provided through a dedicated line.
Example:
We will first request a lightweight example dataset.
$ gtftk get_example -d mini_real -f '*'
|-- 17:46-INFO-get_example : Copying: #H3K4me3_cond_1.bed#
|-- 17:46-INFO-get_example : Copying: airway_love.txt.gz
|-- 17:46-INFO-get_example : Copying: ENCFF112BHN_H3K4me3_K562_sub.bed
|-- 17:46-INFO-get_example : Copying: ENCFF119BYM_H3K36me3_K562_sub.bed
|-- 17:46-INFO-get_example : Copying: ENCFF431HAA_H3K36me3_K562_sub.bw
|-- 17:46-INFO-get_example : Copying: ENCFF742FDS_H3K4me3_K562_sub.bw
|-- 17:46-INFO-get_example : Copying: ENCFF947DVY_H3K79me2_K562_sub.bw
|-- 17:46-INFO-get_example : Copying: H3K4me3_cond_1.bed
|-- 17:46-INFO-get_example : Copying: H3K4me3_cond_2.bed
|-- 17:46-INFO-get_example : Copying: H3K4me3_cond_3.bed
|-- 17:46-INFO-get_example : Copying: hg38.genome
|-- 17:46-INFO-get_example : Copying: hg38.genome.back
|-- 17:46-INFO-get_example : Copying: mini_real.gtf.gz
|-- 17:46-INFO-get_example : Copying: mini_real_control_1.txt
|-- 17:46-INFO-get_example : Copying: mini_real_counts_ENCFF630HEX.txt
|-- 17:46-INFO-get_example : Copying: mini_real_gn_list_hg38.txt
|-- 17:46-INFO-get_example : Copying: tx_classes.txt
Although we could work on the full dataset, we will focus on transcripts whose promoter region do not overlaps with any transcript from another gene.
$ gtftk overlapping -i mini_real.gtf.gz -c hg38.genome -n > mini_real_noov.gtf
We will select a representative transcript for each gene. Here we will perform this step using random_tx although another interesting choice would be rm_dup_tss.
$ gtftk random_tx -i mini_real_noov.gtf -m 1 -s 123 > mini_real_noov_rnd_tx.gtf
Now we will compute coverage of promoters regions using 3 bigWig files as input.
$ gtftk coverage -l H3K4me3,H3K79me2,H3K36me3 -u 5000 -d 5000 -i mini_real_noov_rnd_tx.gtf -c hg38.genome -m transcript_id,gene_name -x ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -k 4 > coverage.bed
Now we can have a look at the result:
$ head -n 10 coverage.bed
chrom start end name strand H3K4me3 H3K79me2 H3K36me3
chr1 996137 1006138 ENST00000624697 + 5.859314 4.0025 1.632737
chr1 1370168 1380169 ENST00000321751 - 25.743926000000002 12.174783 3.20208
chr1 1913796 1923797 ENST00000378602 - 5.943206 2.382962 1.3906610000000001
chr1 2189747 2199748 ENST00000420515 - 9.861013999999999 6.5540449999999995 2.187781
chr1 2492781 2502782 ENST00000473964 + 1.312969 1.083992 1.123888
chr1 3064210 3074211 ENST00000270722 + 1.0 1.0 1.0
chr1 3645072 3655073 ENST00000270708 - 10.365663 4.864514 3.6577339999999996
chr1 6419669 6429670 ENST00000377837 - 2.9228080000000003 2.569843 2.130887
chr1 9178749 9188750 ENST00000437157 - 1.331067 1.272973 1.436956
Arguments:
$ gtftk coverage -h
Usage: gtftk coverage [-i GTF/BED] [-o TXT] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-w nb_window] [-k nb_proc] [-f ft_type] [-l labels] [-m name_column] [-p pseudo_count] [-n n_highest] [-x] [-zn] [-a key_name] [-s {mean,sum}] [-h] [-V ] [-D] [-C] [-K] [-A] [-L] bw_list [bw_list ...]
Description:
* Takes a GTF as input to compute bigwig coverage in regions of interest (promoter, transcript body,
intron, intron_by_tx, tts...) or a BED6 to focus on user-defined regions. If --n-highest is
used the program will compute the coverage of each bigwig based on the average value of the n
windows (--nb-window) with the highest coverage values.
Notes:
* Regions were signal can be computed (if GTF file as input): promoter/tss, tts, introns,
intron_by_tx, intergenic regions or any feature available in the GTF file (transcript, exon,
gene...).
* If --matrix-out is selected, the signal for each bigwig will be provided in a dedicated
column. Otherwise, signal for each bigwig is provided through a dedicated line.
* If bed is used as input, each region should have its own name (column 4).
Version: 2018-02-05
Arguments:
bw_list A list of Bigwig file (last argument).
-i, --inputfile The input GTF/BED file. Only GTF file if <stdin> is used. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
-u, --upstream Extend the regions in 5' by a given value (int). (default: 0)
-d, --downstream Extend the regions in 3' by a given value (int). (default: 0)
-w, --nb-window Split the region into w bins (see -n). (default: 1)
-k, --nb-proc Use this many threads to compute coverage. (default: 1)
-f, --ft-type Region in which coverage is to be computed (promoter, intron, intergenic, tts or any feature defined in the column 3 of the GTF). (default: promoter)
-l, --labels Bigwig labels. (default: None)
-m, --name-column Use this ids to compute the name (4th column in bed output). (default: transcript_id)
-p, --pseudo-count A pseudo-count to add in case count is equal to 0. (default: 1)
-n, --n-highest For each bigwig, use the n windows with higher values to compute coverage. (default: None)
-x, --matrix-out Matrix output format. Bigwigs as column names features as rows. (default: False)
-zn, --zero-to-na Use NA not zero when region is undefined in bigwig or below window size. (default: False)
-a, --key-name If gtf format is requested, the name of the key. (default: cov)
-s, --stat The statistics to be computed for each region. (default: mean)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
mk_matrix¶
Description: Gtftk implements commands that can be used to produce coverage profiles around genomic features or inside user-defined regions. A coverage matrix need first to be produced from a bwig using the mk_matrix command.
Example:
We will used the same dataset (mini_real.gtf) as produced above (see help on coverage command).
We can now create a coverage matrix around TSS/TTS or along the full transcript (with or without 5’ and 3’ regions). Provide a BED file as —inputfile if you want to use your own, user-specific, regions. Will will create tree example datasets:
First we will create a coverage matrix around promoter based on a subset of randomly choose transcripts (one per gene) from the ‘mini_real’ dataset (see section on the coverage command to get info about the construction of the mini_real_noov_rnd_tx.gtf.gz dataset).
$ gtftk get_example -f '*' -d mini_real
|-- 17:46-INFO-get_example : Copying: #H3K4me3_cond_1.bed#
|-- 17:46-INFO-get_example : Copying: airway_love.txt.gz
|-- 17:46-INFO-get_example : Copying: ENCFF112BHN_H3K4me3_K562_sub.bed
|-- 17:46-INFO-get_example : Copying: ENCFF119BYM_H3K36me3_K562_sub.bed
|-- 17:46-INFO-get_example : Copying: ENCFF431HAA_H3K36me3_K562_sub.bw
|-- 17:46-INFO-get_example : Copying: ENCFF742FDS_H3K4me3_K562_sub.bw
|-- 17:46-INFO-get_example : Copying: ENCFF947DVY_H3K79me2_K562_sub.bw
|-- 17:46-INFO-get_example : Copying: H3K4me3_cond_1.bed
|-- 17:46-INFO-get_example : Copying: H3K4me3_cond_2.bed
|-- 17:46-INFO-get_example : Copying: H3K4me3_cond_3.bed
|-- 17:46-INFO-get_example : Copying: hg38.genome
|-- 17:46-INFO-get_example : Copying: hg38.genome.back
|-- 17:46-INFO-get_example : Copying: mini_real.gtf.gz
|-- 17:46-INFO-get_example : Copying: mini_real_control_1.txt
|-- 17:46-INFO-get_example : Copying: mini_real_counts_ENCFF630HEX.txt
|-- 17:46-INFO-get_example : Copying: mini_real_gn_list_hg38.txt
|-- 17:46-INFO-get_example : Copying: tx_classes.txt
$ gtftk get_example -f '*' -d mini_real_noov_rnd_tx
|-- 17:46-INFO-get_example : Copying: mini_real_noov_rnd_tx.gtf.gz
$ gtftk mk_matrix -k 5 -i mini_real_noov_rnd_tx.gtf.gz -d 5000 -u 5000 -w 200 -c hg38.genome -l H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_promoter
Then we will also compute coverage profil around around tts.
$ gtftk mk_matrix -k 5 -i mini_real_noov_rnd_tx.gtf.gz -t tts -d 5000 -u 5000 -w 200 -c hg38.genome -l H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_tts
The following command compute coverage profil along the whole transcript
$ gtftk mk_matrix -k 5 -i mini_real_noov_rnd_tx.gtf.gz -t transcript -d 5000 -u 5000 -w 200 -c hg38.genome -l H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_tx
|-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:47-WARNING-mk_matrix : ENST00000612829 has length : 85
|-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:47-WARNING-mk_matrix : Filter them out please.
|-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:47-WARNING-mk_matrix : ENST00000385018 has length : 82
|-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:47-WARNING-mk_matrix : Filter them out please.
|-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:47-WARNING-mk_matrix : ENST00000583764 has length : 85
|-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:47-WARNING-mk_matrix : Filter them out please.
|-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:47-WARNING-mk_matrix : ENST00000637495 has length : 68
|-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:47-WARNING-mk_matrix : Filter them out please.
Along the whole transcript but increasing the number of windows dedicated to upstream and downstream regions.
$ gtftk mk_matrix -k 5 --bin-around-frac 0.5 -i mini_real_noov_rnd_tx.gtf.gz -t transcript -d 5000 -u 5000 -w 200 -c hg38.genome -l H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_tx_2
|-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:48-WARNING-mk_matrix : ENST00000612829 has length : 85
|-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:48-WARNING-mk_matrix : Filter them out please.
|-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:48-WARNING-mk_matrix : ENST00000385018 has length : 82
|-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:48-WARNING-mk_matrix : Filter them out please.
|-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:48-WARNING-mk_matrix : ENST00000583764 has length : 85
|-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:48-WARNING-mk_matrix : Filter them out please.
|-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:48-WARNING-mk_matrix : ENST00000637495 has length : 68
|-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:48-WARNING-mk_matrix : Filter them out please.
Along a user defined set of regions (in bed6 format). Here we will used the transcript coordinates in bed format as an example.
$ gtftk select_by_key -i mini_real_noov_rnd_tx.gtf.gz -k feature -v transcript | gtftk convert -f bed6 > mini_real_rnd_tx.bed
$ gtftk mk_matrix -k 5 --bin-around-frac 0.5 -i mini_real_rnd_tx.bed -t user_regions -d 5000 -u 5000 -w 200 -c hg38.genome -l H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_user_def
|-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:49-WARNING-mk_matrix : ENSG00000187514|ENST00000612829 has length : 85
|-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:49-WARNING-mk_matrix : Filter them out please.
|-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:49-WARNING-mk_matrix : ENSG00000207751|ENST00000385018 has length : 82
|-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:49-WARNING-mk_matrix : Filter them out please.
|-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:49-WARNING-mk_matrix : ENSG00000110717|ENST00000583764 has length : 85
|-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:49-WARNING-mk_matrix : Filter them out please.
|-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
|-- 17:49-WARNING-mk_matrix : ENSG00000148120|ENST00000637495 has length : 68
|-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
|-- 17:49-WARNING-mk_matrix : Filter them out please.
And finally using a set of single nucleotides coordinates that will be extend (-u/-d) and assessed for coverage. Here we will take the coordinates of TSS as example.
$ gtftk select_by_key -i mini_real_noov_rnd_tx.gtf.gz -k feature -v transcript | gtftk get_5p_3p_coords > tss.bed
$ gtftk mk_matrix -k 5 -u 5000 -d 5000 -i tss.bed -w 200 -l H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_single_nuc -c hg38.genome -t single_nuc
profile¶
Description: This command is used to create profil diagrams from a mk_matrix output. The two important arguments for this command are —group-by, that defines the variable controling the set of colored lines and —facet-var that defines the variable controling the way the plot is facetted . Both —group-by and —facet-var should be set to one of bwig, tx_classes or chrom.
Basic profiles
A simple overlayed profile of all epigenetic marks around promoter. Here —group-by is, by default set to bwig and —facet-var is set to None. Thus a single plot with several lines corresponding to bwig coverage is obtained.
$ gtftk profile -D -i mini_real_promoter.zip -o profile_prom -pf png -if example_01.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

The same diagram is obtained if a bed file pointing to TSS was provided to mk_matrix and used in single_nuc mode.
$ gtftk profile -i mini_real_single_nuc.zip -o profile_prom -pf png -if example_01a.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

Changing colors and applying color order can be done using the following syntax:
$ gtftk profile -D -i mini_real_promoter.zip -c 'red,blue,violet' -d H3K79me,H3K4me3,H3K36me3 -o profile_prom -pf png -if example_01b.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

A subset of bigwig assessed for coverage can be selected for plotting. This is achieved using the –subset-bwig argument:
$ gtftk profile -f bwig -g tx_classes -D -i mini_real_tx.zip -fo -o profile_tx -pf png -if example_01c.png -fo -c 'red' -V 2 -w -tl -e -lw 0.5 -u H3K4me3
|-- 17:50-DEBUG-profile : Using pandas version 0.23.4
|-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
|-- 17:50-DEBUG-profile : Using numpy version 1.13.3
|-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
|-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
|-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
|-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_xf6dvl8r
|-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_xf6dvl8r/mini_real_tx
|-- 17:50-INFO-profile : Getting configuration info from input file.
|-- 17:50-DEBUG-profile : Color order :['All transcripts']
|-- 17:50-DEBUG-profile : Profile color :['red']
|-- 17:50-INFO-profile : Searching coverage columns.
|-- 17:50-INFO-profile : Melting.
|-- 17:50-INFO-profile : Ceiling
|-- 17:50-INFO-profile : Zero value detected. Adding a pseudocount (+1) before log transformation.
|-- 17:50-INFO-profile : Converting to log2.
|-- 17:50-INFO-profile : Computing column ordering.
|-- 17:50-INFO-profile : Preparing diagram
|-- 17:50-INFO-profile : Theming and ordering. Please be patient...
|-- 17:50-INFO-profile : Preparing x axis
|-- 17:50-INFO-profile : facet_col 1
|-- 17:50-INFO-profile : Highlighting upstream regions
|-- 17:50-INFO-profile : Page width set to 4
|-- 17:50-INFO-profile : Page height set to 2.0
|-- 17:50-INFO-profile : Saving diagram to file : example_01c.png
|-- 17:50-INFO-profile : Be patient. This may be long for large datasets.

Transcript coverage is obtained using the mini_real_tx.zip matrix. This provides a simple overlayed profile of all epigenetic marks along the transcript body extended in 5’ and 3’ regions:
$ gtftk profile -D -i mini_real_tx.zip -o profile_tx -pf png -if example_02.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

Almost the same but increasing the bins dedicated to upstream and dowstream regions (see —bin-around-frac argument of *mk_matrix).
$ gtftk profile -D -i mini_real_tx_2.zip -o profile_tx -pf png -if example_03.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

Note that the same is obtained when using user-defined regions (i.e when providing a bed as input corresponding to transcript coordinates).
$ gtftk profile -D -i mini_real_user_def.zip -o profile_udef_4 -pf png -if example_04.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

The same dataset used for plotting but adding a normalization step (ranging). When using ranging normalization, values are expressed as a percentage of the range between max and min value.
$ gtftk profile -D -nm ranging -i mini_real_user_def.zip -o profile_udef_5 -pf png -if example_04b.png
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.

Two examples using statistic ‘max’ and 2 differents values of ‘–upper-limit’.
$ gtftk profile -D -i mini_real_promoter.zip -o profile_prom -pf png -if example_04_max_a.png -V 2 -lw 1 -at 5 -s max -ul 1 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py |-- 17:50-DEBUG-profile : Using numpy version 1.13.3 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_7qeuvl3_ |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_7qeuvl3_/mini_real_promoter |-- 17:50-INFO-profile : Getting configuration info from input file. |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'. |-- 17:50-DEBUG-profile : Color order :['H3K36me3', 'H3K4me3', 'H3K79me'] |-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc'] |-- 17:50-INFO-profile : Searching coverage columns. |-- 17:50-INFO-profile : Melting. |-- 17:50-INFO-profile : Computing column ordering. |-- 17:50-INFO-profile : Preparing diagram |-- 17:50-INFO-profile : Theming and ordering. Please be patient... |-- 17:50-INFO-profile : Preparing x axis |-- 17:50-INFO-profile : facet_col 1 |-- 17:50-INFO-profile : Page width set to 3 |-- 17:50-INFO-profile : Page height set to 2 |-- 17:50-INFO-profile : Saving diagram to file : example_04_max_a.png |-- 17:50-INFO-profile : Be patient. This may be long for large datasets.

$ gtftk profile -D -i mini_real_promoter.zip -o profile_prom -pf png -if example_04_max_b.png -V 2 -lw 1 -at 5 -s max -ul 0.99
|-- 17:50-DEBUG-profile : Using pandas version 0.23.4
|-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
|-- 17:50-DEBUG-profile : Using numpy version 1.13.3
|-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
|-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
|-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
|-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_fc_fcunw
|-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_fc_fcunw/mini_real_promoter
|-- 17:50-INFO-profile : Getting configuration info from input file.
|-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
|-- 17:50-DEBUG-profile : Color order :['H3K79me', 'H3K4me3', 'H3K36me3']
|-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
|-- 17:50-INFO-profile : Searching coverage columns.
|-- 17:50-INFO-profile : Melting.
|-- 17:50-INFO-profile : Ceiling
|-- 17:50-INFO-profile : Computing column ordering.
|-- 17:50-INFO-profile : Preparing diagram
|-- 17:50-INFO-profile : Theming and ordering. Please be patient...
|-- 17:50-INFO-profile : Preparing x axis
|-- 17:50-INFO-profile : facet_col 1
|-- 17:50-INFO-profile : Page width set to 3
|-- 17:50-INFO-profile : Page height set to 2
|-- 17:50-INFO-profile : Saving diagram to file : example_04_max_b.png
|-- 17:50-INFO-profile : Be patient. This may be long for large datasets.

Faceted profiles
Faceted plot of epigenetic profiles. The groups (i.e colors/lines) can be set to bwig classes and the facets to transcript classes. Things can be simply done by providing an additional file containing the transcript and their associated classes.
Example:
$ gtftk profile -D -i mini_real_promoter.zip -f tx_classes -g bwig -fo -t tx_classes.txt -o profile_prom -pf png -if example_05.png -e -V 2 -fc 2
|-- 17:50-DEBUG-profile : Using pandas version 0.23.4
|-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
|-- 17:50-DEBUG-profile : Using numpy version 1.13.3
|-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
|-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
|-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
|-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_86ftqnm9
|-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_86ftqnm9/mini_real_promoter
|-- 17:50-INFO-profile : Getting configuration info from input file.
|-- 17:50-INFO-profile : Reading transcript file.
|-- 17:50-INFO-profile : Deleting duplicates in transcript-file.
|-- 17:50-INFO-profile : Checking how many genes where found in the transcript list.
|-- 17:50-INFO-profile : Keeping 804 transcript out of 833.
|-- 17:50-DEBUG-profile : Color order :['H3K79me', 'H3K4me3', 'H3K36me3']
|-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
|-- 17:50-INFO-profile : Searching coverage columns.
|-- 17:50-INFO-profile : Melting.
|-- 17:50-INFO-profile : Ceiling
|-- 17:50-INFO-profile : Computing column ordering.
|-- 17:50-INFO-profile : Preparing diagram
|-- 17:50-INFO-profile : Theming and ordering. Please be patient...
|-- 17:50-INFO-profile : Preparing x axis
|-- 17:50-INFO-profile : facet_col 2
|-- 17:50-INFO-profile : Page width set to 6
|-- 17:50-INFO-profile : Page height set to 5.0
|-- 17:50-INFO-profile : Saving diagram to file : example_05.png
|-- 17:50-INFO-profile : Be patient. This may be long for large datasets.

Alternatively, the groups can be set to chromosomes or transcript classes:
$ gtftk profile -D -i mini_real_promoter.zip -g tx_classes -f bwig -fo -t tx_classes.txt -o profile_prom -pf png -if example_06.png -V 2 -nm ranging
|-- 17:50-DEBUG-profile : Using pandas version 0.23.4
|-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
|-- 17:50-DEBUG-profile : Using numpy version 1.13.3
|-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
|-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
|-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
|-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_9k2idgi4
|-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_9k2idgi4/mini_real_promoter
|-- 17:50-INFO-profile : Getting configuration info from input file.
|-- 17:50-INFO-profile : Reading transcript file.
|-- 17:50-INFO-profile : Deleting duplicates in transcript-file.
|-- 17:50-INFO-profile : Checking how many genes where found in the transcript list.
|-- 17:50-INFO-profile : Keeping 804 transcript out of 833.
|-- 17:50-DEBUG-profile : Color order :['lincRNA', 'protein_coding', 'antisense']
|-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
|-- 17:50-INFO-profile : Searching coverage columns.
|-- 17:50-INFO-profile : Melting.
|-- 17:50-INFO-profile : Ceiling
|-- 17:50-INFO-profile : Normalizing (ranging)
|-- 17:50-INFO-profile : Computing column ordering.
|-- 17:50-INFO-profile : Preparing diagram
|-- 17:50-INFO-profile : Theming and ordering. Please be patient...
|-- 17:50-INFO-profile : Preparing x axis
|-- 17:50-INFO-profile : facet_col 3
|-- 17:50-INFO-profile : Page width set to 9
|-- 17:50-INFO-profile : Page height set to 2.0
|-- 17:50-INFO-profile : Saving diagram to file : example_06.png
|-- 17:50-INFO-profile : Be patient. This may be long for large datasets.

$ gtftk profile -D -i mini_real_promoter.zip -g chrom -f bwig -fo -t tx_classes.txt -o profile_prom -pf png -if example_06b.png -V 2 -nm ranging
|-- 17:50-DEBUG-profile : Using pandas version 0.23.4
|-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
|-- 17:50-DEBUG-profile : Using numpy version 1.13.3
|-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
|-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
|-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
|-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_z89mi377
|-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_z89mi377/mini_real_promoter
|-- 17:50-INFO-profile : Getting configuration info from input file.
|-- 17:50-DEBUG-profile : Color order :['chr11', 'chr15', 'chr2', 'chr3', 'chr13', 'chr14', 'chr1', 'chr6', 'chr4', 'chr9', 'chr21', 'chr8', 'chr7', 'chr10', 'chr17', 'chr18', 'chrX', 'chr19', 'chr12', 'chr22', 'chr5', 'chr20', 'chr16']
|-- 17:50-DEBUG-profile : Profile color :['#000000', '#6c007c', '#850096', '#2500a5', '#0000ca', '#0041dd', '#0086dd', '#009fca', '#00aaa1', '#00a76f', '#009c00', '#00bb00', '#00da00', '#00f900', '#88ff00', '#dbf400', '#f7db00', '#ffb500', '#ff6100', '#f60000', '#da0000', '#cc1313', '#cccccc']
|-- 17:50-INFO-profile : Searching coverage columns.
|-- 17:50-INFO-profile : Melting.
|-- 17:50-INFO-profile : Ceiling
|-- 17:51-INFO-profile : Normalizing (ranging)
|-- 17:51-INFO-profile : Computing column ordering.
|-- 17:51-INFO-profile : Preparing diagram
|-- 17:51-INFO-profile : Theming and ordering. Please be patient...
|-- 17:51-INFO-profile : Preparing x axis
|-- 17:51-INFO-profile : facet_col 3
|-- 17:51-INFO-profile : Page width set to 9
|-- 17:51-INFO-profile : Page height set to 2.0
|-- 17:51-INFO-profile : Saving diagram to file : example_06b.png
|-- 17:51-INFO-profile : Be patient. This may be long for large datasets.

Note that facets may also be associated to epigenetic marks. In this case each the –group-by can be set to tx_classes or chrom.
$ gtftk profile -D -i mini_real_tx_2.zip -g tx_classes -t tx_classes.txt -f bwig -o profile_tx -pf png -if example_07.png -fo -w -nm ranging

$ gtftk profile -D -i mini_real_tx_2.zip -g chrom -f bwig -o profile_tx -pf png -if example_08.png -fo -w -nm ranging

Theming
The –theme argument controls plotnine theming.
$ gtftk profile -th classic -D -i mini_real_promoter.zip -g bwig -f chrom -o profile_prom -c "#66C2A5,#FC8D62,#8DA0CB,#6734AF" -pf png -if example_09b.png

$ gtftk profile -th seaborn -D -i mini_real_promoter.zip -g bwig -f chrom -o profile_prom -c "#66C2A5,#FC8D62,#8DA0CB,#6734AF" -pf png -if example_10.png

$ gtftk profile -th matplotlib -D -i mini_real_promoter.zip -g bwig -f chrom -o profile_prom -c "#66C2A5,#FC8D62,#8DA0CB,#6734AF" -pf png -if example_11.png

Playing with various commands
It is also possible to use several of the previously seen commands to easily achieve more complexe analyses. Here we will plot the epigenetic signal according to RNA-seq counts.
$ gtftk join_attr -i mini_real_noov_rnd_tx.gtf -j mini_real_counts_ENCFF630HEX.txt -k gene_name -n counts | gtftk discretize_key -k counts -n 6 -d count_levels -pu | gtftk tabulate -k transcript_id,count_levels -o tx_exp_classes.txt -Hun
|-- 17:52-INFO-discretize_key : Categories: ['(-41.703_183.167]', '(183.167_574.667]', '(574.667_1035.0]', '(1035.0_1647.333]', '(1647.333_3212.667]', '(3212.667_41703.0]']
$ gtftk profile -D -i mini_real_tx.zip -o profile_tx -pf png -if example_12.png -g tx_classes -t tx_exp_classes.txt -f bwig -w -nm ranging -m viridis

Arguments:
$ gtftk profile -h
Usage: gtftk profile -i MATRIX [-o DIR] [-t transcript_file] [-s {mean,median,sum,min,max}] [-e] [-c profile_colors] [-d color_order] [-g {bwig,tx_classes,chrom}] [-f {bwig,tx_classes,chrom}] [-pw page_width] [-ph page_height] [-pf {pdf,png}] [-lw line_width] [-bc border_color] [-x x_lab] [-at axis_text] [-st strip_text] [-u subset_bwig] [-fc facet_col] [-fo] [-w] [-if user_img_file] [-ul upper_limit] [-nm {none,ranging}] [-tl] [-ti title] [-dpi dpi] [-th {538,bw,grey,gray,linedraw,light,dark,minimal,classic,void,test,matplotlib,seaborn}] [-m palette] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Produces bigWig coverage profiles using calls to plotnine graphic package.
Notes:
* The ranging normalization method [1] implies the following transformation:
* - (x_i - min(x))/(max(x) - min(x)).
* Think about using normalized bigWig files as input to mk_matrix. This will limit the
requirement for an additional normalization step (see Deeptools for a set of useful methods
implemented in bamCoverage/bamCompare).
References:
* [1] Numerical Ecology - second Edition - P. Legendre, L. Legendre (1998) Elsevier.
Version: 2018-01-20
Arguments:
-i, --inputfile A zip file containing a matrix as produced by mk_matrix. (default: None)
-o, --out-dir Output directory name. (default: draw_profile)
-t, --transcript-file A two columns file with the transcripts of interest and their classes. (default: None)
-s, --stat The statistics to be computed. (default: mean)
-e, --confidence-interval Add a confidence interval to estimate standard error of the mean. (default: False)
-c, --profile-colors Colors. (default: None)
-d, --color-order Factor ordering. Comma separated bwig labels or tx classes. (default: None)
-g, --group-by The variable used for grouping. (default: None)
-f, --facet-var The variable to be used for splitting into facets. (default: None)
-pw, --page-width Output pdf file width (e.g. 7 inches). (default: None)
-ph, --page-height Output file height (e.g. 5 inches). (default: None)
-pf, --page-format Output file format. (default: pdf)
-lw, --line-width Line width. (default: 1.25)
-bc, --border-color Border color for the plot. (default: #777777)
-x, --x-lab X axis label. (default: Selected genomic regions)
-at, --axis-text Size of axis text. (default: 8)
-st, --strip-text Size of strip text. (default: 8)
-u, --subset-bwig Use only a subset of the bigwigs for plotting (default: None)
-fc, --facet-col Number of facet columns. (default: 4)
-fo, --force-tx-class Force even if some transcripts from --transcript-file were not found. (default: False)
-w, --show-group-number Show the number of element per group. (default: False)
-if, --user-img-file Provide an alternative path for the image. (default: None)
-ul, --upper-limit Upper limit based on quantile computed from unique values. (default: 0.95)
-nm, --normalization-method The normalization method performed on a per bigwig basis. (default: none)
-tl, --to-log Control whether the data should be log2-transform before plotting. (default: False)
-ti, --title A title for the diagram. (default: )
-dpi, --dpi Dpi to use. (default: 300)
-th, --theme-plotnine The theme for plotnine diagram. (default: bw)
-m, --palette A color palette (see: https://tinyurl.com/ydacyfxx). (default: nipy_spectral)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
Commands from section ‘miscellaneous’¶
col_from_tab¶
Description: Select columns from a tabulated file based on their names.
Example:
$ gtftk get_example | gtftk tabulate -k all |gtftk col_from_tab -c start,end,seqid | head -n 20
start end seqid
Arguments:
$ gtftk col_from_tab -h
Usage: gtftk col_from_tab [-i GTF] [-o TXT] -c columns [-n] [-u] [-s SEP] [-H] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Select columns from a tabulated file based on their names.
Version: 2018-01-20
Arguments:
-i, --inputfile The tabulated file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --columns The list (csv) of column names. (default: None)
-n, --invert-match Not/invert match. (default: False)
-u, --unique Write non redondant lines. (default: False)
-s, --separator The separator to be used for separating name elements (see -n). (default: )
-H, --no-header Don't print the header line. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
control_list¶
Description: Returns a list of gene matched for expression based on reference values. Based on a reference gene list (or more generally IDs) this command tries to extract a set of other genes/IDs matched for signal/expression. The –reference-gene-file contains the list of reference IDs while the –inputfile contains a tuple gene/signal for all genes.
Example:
$ gtftk control_list -i mini_real_counts_ENCFF630HEX.txt -r mini_real_control_1.txt -D -V 2 -s -l -p 1 -ju -if example_13.png -pf png
|-- 17:52-INFO-control_list : 0 duplicate lines have been deleted in reference file.
|-- 17:52-INFO-control_list : Found 50 genes of the reference in the provided signal file
|-- 17:52-INFO-control_list : All reference genes were found.
|-- 17:52-INFO-control_list : Searching for genes with matched signal.
|-- 17:53-INFO-control_list : Preparing a dataframe for plotting.
|-- 17:53-INFO-control_list : Saving diagram to file : example_13.png
|-- 17:53-INFO-control_list : Be patient. This may be long for large datasets.

Arguments:
$ gtftk control_list -h
Usage: gtftk control_list --in-file TXT --referenceGeneFile TXT [--out-dir DIR] [--log2] [--pseudo-count pseudo_count] [-pw page_width] [-ph page_height] [-pf {pdf,png}] [-dpi dpi] [--skip-first] [--rug] [--jitter] [-if user_img_file] [-c set_colors] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]
Description:
* Based on a reference gene list (or more generally IDs) this command tries to extract a set of
other genes/IDs matched for signal/expression. The --reference-gene-file contains the list of
reference IDs while the --inputfile contains a tuple gene/signal for all genes.
Notes:
* --infile is a two columns tabulated file. The first column contains the list of ids
(including reference IDs) and the second column contains the expression/signal values. This
file should contain no header.
* Think about discarding any unwanted IDs from --infile before calling control_list.
Version: 2018-01-20
optional arguments:
--in-file, -i A two columns tab-file. See notes. (default: None)
--referenceGeneFile, -r The file containing the reference gene list (1 column, transcript ids). No header. (default: None)
--out-dir, -o Name of the output directory. (default: control_list)
--log2, -l If selected, data will be log transformed. (default: False)
--pseudo-count, -p The value for a pseudo-count to be added. (default: 0)
-pw, --page-width Output pdf file width (e.g. 7 inches). (default: None)
-ph, --page-height Output file height (e.g. 5 inches). (default: None)
-pf, --page-format Output file format. (default: pdf)
-dpi, --dpi Dpi to use. (default: 300)
--skip-first, -s Indicates that infile hase a header. (default: False)
--rug, -u Add rugs to the diagram. (default: False)
--jitter, -j Add jittered points. (default: False)
-if, --user-img-file Provide an alternative path for the image. (default: None)
-c, --set-colors Colors for the two sets (comma separated). (default: #b2df8a,#6a3d9a)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Increase output verbosity. (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)