Help on gtftk Unix commands

Main parser arguments of gtftk

Getting help with -h

The -h argument can be used to get a synopsis for implemented commands.

$ gtftk -h
  Usage: gtftk [-h] [-b] [-p] [-s] [-v] [-l]  ...

  A toolbox to handle GTF files.

  Example:

  gtftk get_example -f chromInfo -o simple.chromInfo ; 
  gtftk get_example  | gtftk feature_size -t mature_rna | gtftk nb_exons |\
  gtftk intron_sizes | gtftk exon_sizes | gtftk convergent -u 24 -d 24  -c simple.chromInfo | \
  gtftk divergent -u 101 -d 10  -c simple.chromInfo  | \
  gtftk overlapping -u 0 -d 0 -t transcript -c simple.chromInfo -a |  \
  gtftk select_by_key -k feature -v transcript |   gtftk tabulate -k "*" -b
  

  Type 'gtftk sub-command -h' for more information.

    

Main command arguments:
 -h, --help                   show this help message and exit
 -b, --bash-comp              Get a script to activate bash completion. (default: False)
 -p, --plugin-tests           Display bats tests for all plugin. (default: False)
 -s, --system-info            Display some info about the system. (default: False)
 -v, --version                show program's version number and exit
 -l, --list-plugins           Get the list of plugins. (default: False)

Available sub-commands/plugins:
 
  
------- editing --------

   add_prefix                 Add a prefix or suffix to target values.
   del_attr                   Delete attributes in the target gtf file.
   discretize_key             Create a new key through discretization of a numeric key.
   join_attr                  Join attributes from a tabulated file.
   join_multi_file            Join attributes from mutiple files.
   merge_attr                 Merge a set of attributes into a destination attribute.
  
----- information ------

   add_exon_nb                Add exon number transcript-wise.
   apropos                    Search in all command description files those related to a user-defined keyword.
   count                      Count the number of features in the gtf file.
   count_key_values           Count the number values for a set of keys.
   feature_size               Compute the size of features enclosed in the GTF.
   get_attr_list              Get the list of attributes from a GTF file.
   get_attr_value_list        Get the list of values observed for an attributes.
   get_example                Get example files including GTF.
   get_feature_list           Get the list of features enclosed in the GTF.
   nb_exons                   Count the number of exons by transcript.
   nb_transcripts             Count the number of transcript per gene.
   retrieve                   Retrieve a GTF file from ensembl.
   seqid_list                 Returns the chromosome list.
   tss_dist                   Computes the distance between TSS of gene transcripts.
  
------ selection -------

   random_list                Select a random list of genes or transcripts.
   random_tx                  Select randomly up to m transcript for each gene.
   rm_dup_tss                 If several transcripts of a gene share the same TSS, select only one representative.
   select_by_go               Select lines from a GTF file using a Gene Ontology ID.
   select_by_intron_size      Select transcripts by intron size.
   select_by_key              Select lines from a GTF based on attributes and values.
   select_by_loc              Select transcript/gene overlapping a genomic feature.
   select_by_max_exon_nb      For each gene select the transcript with the highest number of exons.
   select_by_nb_exon          Select transcripts based on the number of exons.
   select_by_numeric_value    Select lines from a GTF file based on a boolean test on numeric values.
   select_by_regexp           Select lines from a GTF file based on a regexp.
   select_by_tx_size          Select transcript based on their size (i.e size of mature/spliced transcript).
   select_most_5p_tx          Select the most 5' transcript of each gene.
   short_long                 Get the shortest or longest transcript of each gene
  
------ conversion ------

   bed_to_gtf                 Convert a bed file to a gtf but with lots of empty fields...
   convert                    Convert a GTF to various format including bed.
   convert_ensembl            Convert the GTF file to ensembl format. Essentially add 'transcript'/'gene' features.
   tabulate                   Convert a GTF to tabulated format.
  
------ annotation ------

   closest_genes              Find the n closest genes for each transcript.
   convergent                 Find transcripts with convergent tts.
   divergent                  Find transcripts with divergent promoters.
   exon_sizes                 Add a new key to transcript features containing a comma separated list of exon sizes.
   intron_sizes               Add a new key to transcript features containing a comma separated list of intron sizes.
   overlapping                Find (non)overlapping transcripts.
  
------- sequence -------

   get_feat_seq               Get feature sequence (e.g exon, UTR...).
   get_tx_seq                 Get transcript sequences in fasta format.
  
----- coordinates ------

   get_5p_3p_coords           Get the 5p or 3p coordinate for each feature. TSS or TTS for a transcript.
   intergenic                 Extract intergenic regions.
   intronic                   Extract intronic regions.
   midpoints                  Get the midpoint coordinates for the requested feature.
   shift                      Transpose coordinates.
   splicing_site              Compute the locations of donor and acceptor splice sites.
  
------- coverage -------

   coverage                   Compute bigwig coverage in body, promoter, tts...
   mk_matrix                  Compute a coverage matrix (see profile).
   profile                    Create coverage profile using a bigWig as input.
  
----- miscellaneous ----

   col_from_tab               Select columns from a tabulated file based on their names.
   control_list               Returns a list of gene matched for expression based on reference values.

------------------------

Activating Bash completion

The code provided below can be useful to activate bash completion.

# Use the -b argument of gtftk
# This will produce a script that you
# should store in your .bashrc
gtftk -b

Or alternatively

echo "" >> ~/.bashrc
gtftk -b >> ~/.bashrc

Getting the list of funtional tests

One can access to the list of implemented tests through the -p/–plugin-tests arguments. These tests may be run using bats (Bash Automated Testing System).

# gtftk --plugin-tests

Command-wide arguments

Description: The following arguments are available in almost all gtftk commands :

  • -h, –help : Refers to argument list and details.
  • -i, –inputfile: Refers to the input file (may be <stdin>).
  • -o, –outputfile: Refers to the output file (may be <stdout>).
  • -D, –no-date: Do not add date to output file names.
  • -C, –add-chr: Add ‘chr’ to chromosome names before printing output.
  • -V, –verbosity: Increase output verbosity (can take value from 0 to 4).
  • -K –tmp-dir: Keep all temporary files into this folder.
  • -L, –logger-file: Store the values of all command line arguments into a file.

Commands from section ‘information’

apropos

Description: Search in all command description files those related to a user-defined keyword.

Example: Search all commands related to promoters.

$ gtftk apropos -k promoter
 |-- 17:42-INFO-apropos : >> Keyword 'promoter' was found in the following command:
	- coverage.
	- divergent.

Arguments:

$ gtftk apropos -h
  Usage: gtftk apropos -k keyword [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Search in all command description files those related to a user-defined keyword.

  Version:  2017-09-27

Arguments:
 -k, --keyword      The keyword (default: None)
 -n, --notes        Look also for the keywords in notes associated to each command. (default: False)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

retrieve

Description: Retrieve a GTF file from ensembl.

Example: List the available GTF files in ensembl FTP. Bacteria are not listed at the moment.

$ # gtftk retrieve -l | head -5

Example: Perform basic statistics on Vicugna pacos genomic annotations.

$ # gtftk retrieve -s vicugna_pacos -c  -d | gtftk  count -t vicugna_pacos

Arguments:

$ gtftk retrieve -h
  Usage: gtftk retrieve [-s SPECIES] [-o GTF.GZ] [-e {vertebrate,protists,fungi,plants,metazoa}] [-r RELEASE] [-l] [-hs] [-c] [-d] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Retrieve a GTF file from ensembl.

  Version:  2018-01-31

Arguments:
 -s, --species-name        The species name. (default: homo_sapiens)
 -o, --outputfile          Output file (gtf.gz). (default: None)
 -e, --ensembl-collection  Which ensembl collection to interrogate('vertebrate', 'protists', 'fungi', 'plants', 'metazoa'). (default: vertebrate)
 -r, --release             Release number. By default, the latest. (default: None)
 -l, --list-only           If selected, list available species. (default: False)
 -hs, --hide-species-name  If selected, hide species names upon listing. (default: False)
 -c, --to-stdout           This option specifies that output will go to the standard output stream, leaving downloaded files intact (or not, see -d). (default: False)
 -d, --delete              Delete the gtf file after processing (e.g if -c is used). (default: False)

Command-wise optional arguments:
 -h, --help                Show this help message and exit.
 -V, --verbosity           Increase output verbosity. (default: 0)
 -D, --no-date             Do not add date to output file names. (default: False)
 -C, --add-chr             Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir             Keep all temporary files into this folder. (default: None)
 -A, --keep-all            Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file         Stores the arguments passed to the command into a file. (default: None)

get_example

Description: Get an example GTF file (or any other kind of example available in the installation directory). This command is only provided for demonstration purpose.

We can see from the example below that this gtf file follows the ensembl format and contains the transcript and gene features (column 3).

Example: The very basic (and artificial example).

$ gtftk get_example| head -2
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";

Example: A more realistic example containing a subset of transcript (n=8531) corresponding to 1058 genes from human annotation.

$ gtftk get_example -d mini_real | gtftk count
gene	1058
transcript	8531
exon	64251
five_prime_utr	7561
CDS	41371
start_codon	4171
stop_codon	3753
three_prime_utr	6972
Selenocysteine	2

let’s get all files from the simple dataset.

$ gtftk get_example -d simple -f '*'
 |-- 17:42-INFO-get_example : Copying: add_attr_to_pos.tab
 |-- 17:42-INFO-get_example : Copying: simple.1.bt2
 |-- 17:42-INFO-get_example : Copying: simple.2.bt2
 |-- 17:42-INFO-get_example : Copying: simple.2.bw
 |-- 17:42-INFO-get_example : Copying: simple.3.bt2
 |-- 17:42-INFO-get_example : Copying: simple.3.bw
 |-- 17:42-INFO-get_example : Copying: simple.4.bt2
 |-- 17:42-INFO-get_example : Copying: simple.bam
 |-- 17:42-INFO-get_example : Copying: simple.bam.bai
 |-- 17:42-INFO-get_example : Copying: simple.bw
 |-- 17:42-INFO-get_example : Copying: simple.chromInfo
 |-- 17:42-INFO-get_example : Copying: simple.fa
 |-- 17:42-INFO-get_example : Copying: simple.fa.fai
 |-- 17:42-INFO-get_example : Copying: simple.geneList
 |-- 17:42-INFO-get_example : Copying: simple.genes
 |-- 17:42-INFO-get_example : Copying: simple.gtf
 |-- 17:42-INFO-get_example : Copying: simple.join
 |-- 17:42-INFO-get_example : Copying: simple.join_mat
 |-- 17:42-INFO-get_example : Copying: simple.join_mat_2
 |-- 17:42-INFO-get_example : Copying: simple.join_mat_3
 |-- 17:42-INFO-get_example : Copying: simple.join_with_dup
 |-- 17:42-INFO-get_example : Copying: simple.loc_bed
 |-- 17:42-INFO-get_example : Copying: simple.map
 |-- 17:42-INFO-get_example : Copying: simple.rev.1.bt2
 |-- 17:42-INFO-get_example : Copying: simple.rev.2.bt2
 |-- 17:42-INFO-get_example : Copying: simple_col.csv
 |-- 17:42-INFO-get_example : Copying: simple_peaks.bed
 |-- 17:42-INFO-get_example : Copying: simple_peaks.bed3
 |-- 17:42-INFO-get_example : Copying: simple_peaks.bed6
 |-- 17:42-INFO-get_example : Copying: simple_reads.fq

Arguments:

$ gtftk get_example -h
  Usage: gtftk get_example [-d {simple,mini_real,mini_real_noov_rnd_tx,simple_02,simple_03,simple_04,simple_05}] [-f {*,gtf,bed,bw,bam,join,join_mat,chromInfo,fa,fa.idx,genes,geneList,2.bw,genome}] [-o OUTPUT] [-l] [-a] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Print example files including GTF.

  Notes:
     *  Use format '*' to get all files from a dataset.

  Version:  2018-01-20

Arguments:
 -d, --dataset      Select a dataset. (default: simple)
 -f, --format       The dataset format. (default: gtf)
 -o, --outputfile   Output file. (default: <stdout>)
 -l, --list         Only list files of a dataset. (default: False)
 -a, --all-dataset  Get the list of all datasets. (default: False)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

add_exon_nb

Description: Add exon number transcript-wise (based on 5’ to 3’ orientation).

Example:

$ gtftk  get_example -f gtf | gtftk add_exon_nb  | gtftk select_by_key -k feature -v exon
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001"; exon_nbr "1";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001"; exon_nbr "1";
chr1	gtftk	exon	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001"; exon_nbr "1";
chr1	gtftk	exon	50	54	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001"; exon_nbr "2";
chr1	gtftk	exon	57	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002"; exon_nbr "1";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001"; exon_nbr "1";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002"; exon_nbr "2";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003"; exon_nbr "3";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001"; exon_nbr "1";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002"; exon_nbr "2";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E003"; exon_nbr "3";
chr1	gtftk	exon	33	35	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E001"; exon_nbr "2";
chr1	gtftk	exon	42	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E002"; exon_nbr "1";
chr1	gtftk	exon	22	25	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E001"; exon_nbr "3";
chr1	gtftk	exon	28	30	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E002"; exon_nbr "2";
chr1	gtftk	exon	33	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E003"; exon_nbr "1";
chr1	gtftk	exon	28	30	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E001"; exon_nbr "2";
chr1	gtftk	exon	33	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E002"; exon_nbr "1";
chr1	gtftk	exon	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001"; exon_id "G0007T001E001"; exon_nbr "1";
chr1	gtftk	exon	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T002"; exon_id "G0007T002E001"; exon_nbr "1";
chr1	gtftk	exon	210	214	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E001"; exon_nbr "2";
chr1	gtftk	exon	220	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E002"; exon_nbr "1";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001"; exon_nbr "1";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001"; exon_nbr "1";
chr1	gtftk	exon	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001"; exon_id "G0010T001E001"; exon_nbr "1";
$ gtftk get_example -f gtf | gtftk add_exon_nb  -k exon_number | gtftk select_by_key -k feature -v exon | gtftk tabulate -k chrom,start,end,exon_number,transcript_id | head -n 20
seqid	start	end	exon_number	transcript_id
chr1	125	138	1	G0001T002
chr1	125	138	1	G0001T001
chr1	180	189	1	G0002T001
chr1	50	54	2	G0003T001
chr1	57	61	1	G0003T001
chr1	65	68	1	G0004T002
chr1	71	71	2	G0004T002
chr1	74	76	3	G0004T002
chr1	65	68	1	G0004T001
chr1	71	71	2	G0004T001
chr1	74	76	3	G0004T001
chr1	33	35	2	G0005T001
chr1	42	47	1	G0005T001
chr1	22	25	3	G0006T001
chr1	28	30	2	G0006T001
chr1	33	35	1	G0006T001
chr1	28	30	2	G0006T002
chr1	33	35	1	G0006T002
chr1	107	116	1	G0007T001

Arguments:

$ gtftk add_exon_nb -h
  Usage: gtftk add_exon_nb [-i GTF] [-o GTF] [-k exon_numbering_key] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Add exon number transcript-wise (based on 5' to 3' orientation).

  Version:  2018-01-20

Arguments:
 -i, --inputfile           Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile          Output file. (default: <stdout>)
 -k, --exon-numbering-key  The name of the key containing the exon numbering. (default: exon_nbr)

Command-wise optional arguments:
 -h, --help                Show this help message and exit.
 -V, --verbosity           Increase output verbosity. (default: 0)
 -D, --no-date             Do not add date to output file names. (default: False)
 -C, --add-chr             Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir             Keep all temporary files into this folder. (default: None)
 -A, --keep-all            Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file         Stores the arguments passed to the command into a file. (default: None)

count

Description: Count the number of features (transcripts, genes, exons, introns).

Example:

$ gtftk  get_example -f gtf | gtftk count  -t example_gtf
gene	10	example_gtf
transcript	15	example_gtf
exon	25	example_gtf
CDS	20	example_gtf

Arguments:

$ gtftk count -h
  Usage: gtftk count [-i GTF] [-o TXT] [-d header] [-t TEXT] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Count the number of each features in the gtf file.

  Version:  2018-01-20

Arguments:
 -i, --inputfile        Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile       Output file. (default: <stdout>)
 -d, --header           A comma-separated list of string to use as header. (default: None)
 -t, --additional-text  A facultative text to be printed in the third column (e.g species name). (default: None)

Command-wise optional arguments:
 -h, --help             Show this help message and exit.
 -V, --verbosity        Increase output verbosity. (default: 0)
 -D, --no-date          Do not add date to output file names. (default: False)
 -C, --add-chr          Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir          Keep all temporary files into this folder. (default: None)
 -A, --keep-all         Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file      Stores the arguments passed to the command into a file. (default: None)

count_key_values

Description: Count the number values for a set of keys.

Example: Count the number of time gene_id and transcript_id appear in the GTF file.

$ gtftk get_example | gtftk count_key_values -k gene_id,transcript_id
gene_id	70
transcript_id	70

Example: Count the number of non-redondant entries for chromosomes and transcript_id.

$ gtftk get_example | gtftk count_key_values -k chrom,transcript_id -u
chrom	1
transcript_id	16

Arguments:

$ gtftk count_key_values -h
  Usage: gtftk count_key_values [-i GTF] [-o TXT] [-k keys] [-t TEXT] [-u] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Count the number of values/levels for a set of keys.

  Version:  2018-01-20

Arguments:
 -i, --inputfile        Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile       Output file. (default: <stdout>)
 -k, --keys             The set of keys of interest. (default: *)
 -t, --additional-text  A facultative text to be printed in the third column (e.g species name). (default: None)
 -u, --uniq             Ask for the count of non redondant values. (default: False)

Command-wise optional arguments:
 -h, --help             Show this help message and exit.
 -V, --verbosity        Increase output verbosity. (default: 0)
 -D, --no-date          Do not add date to output file names. (default: False)
 -C, --add-chr          Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir          Keep all temporary files into this folder. (default: None)
 -A, --keep-all         Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file      Stores the arguments passed to the command into a file. (default: None)

get_attr_list

Description: Get the list of attributes from a GTF file.

Example: Get the list of attributes in the “simple” dataset.

$ gtftk get_example | gtftk get_attr_list
gene_id
transcript_id
exon_id
ccds_id

Arguments:

$ gtftk get_attr_list -h
  Usage: gtftk get_attr_list [-i GTF] [-o TXT] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get the list of attributes from a GTF file.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -s, --separator    The separator to be used for separating key names. (default: )

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

get_attr_value_list

Description: Get the list of values observed for an attributes.

Example: Get the list of values observed for transcript_id.

$ gtftk get_example | gtftk get_attr_value_list -k transcript_id
G0001T001
G0001T002
G0002T001
G0003T001
G0004T001
G0004T002
G0005T001
G0006T001
G0006T002
G0007T001
G0007T002
G0008T001
G0009T001
G0009T002
G0010T001

Example: Get the number of time each gene_id is used.

$ gtftk get_example | gtftk get_attr_value_list -k gene_id -c -s ';'
G0001;7
G0002;4
G0003;5
G0004;13
G0005;5
G0006;13
G0007;7
G0008;5
G0009;7
G0010;4

Arguments:

$ gtftk get_attr_value_list -h
  Usage: gtftk get_attr_value_list [-i GTF] [-o TXT] -k key_name [-s SEP] [-c] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get the list of values observed for an attributes.

  Version:  2018-02-11

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -k, --key-name     Key name. (default: None)
 -s, --separator    The separator to be used for separating key names. (default: )
 -c, --count        Add the counts for each key (separator will be set to ' ' by default). (default: False)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

get_feature_list

Description: Get the list of features enclosed in the GTF.

Example: Get the list of features enclosed in the GTF.

$ gtftk get_example | gtftk get_feature_list
gene
transcript
exon
CDS

Arguments:

$ gtftk get_feature_list -h
  Usage: gtftk get_feature_list [-i GTF] [-o TXT] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get the list of features enclosed in the GTF.

  Version:  2018-02-11

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -s, --separator    The separator to be used for separating key names. (default: )

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

nb_exons

Description: Count the number of exons and add it as a novel key/value. Output may also be in text format if requested.

Example:

$ gtftk  get_example -f gtf | gtftk nb_exons | head -n 5
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; nb_exons "1";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1	gtftk	CDS	125	130	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; ccds_id "CDS_G0001T002";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; nb_exons "1";
$ gtftk  get_example -f gtf | gtftk nb_exons  | gtftk select_by_key -k feature -v transcript | head -n 5
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; nb_exons "1";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; nb_exons "1";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; nb_exons "1";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; nb_exons "2";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; nb_exons "3";

Arguments:

$ gtftk nb_exons -h
  Usage: gtftk nb_exons [-i GTF] [-o TXT/GTF] [-f] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Returns the transcript name and number of exons with nb_exons as a novel key for each transcript
     feature.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -f, --text-format  Return a text format. (default: False)
 -a, --key-name     The name of the key. (default: nb_exons)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

nb_transcripts

Description: Count the number of transcript per gene.

Example: Count the number of transcript per gene.

$ gtftk get_example |  gtftk nb_transcripts  | gtftk select_by_key -g
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001"; nb_tx "2";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002"; nb_tx "1";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003"; nb_tx "1";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004"; nb_tx "2";
chr1	gtftk	gene	33	47	.	-	.	gene_id "G0005"; nb_tx "1";
chr1	gtftk	gene	22	35	.	-	.	gene_id "G0006"; nb_tx "2";
chr1	gtftk	gene	107	116	.	+	.	gene_id "G0007"; nb_tx "2";
chr1	gtftk	gene	210	222	.	-	.	gene_id "G0008"; nb_tx "1";
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; nb_tx "2";
chr1	gtftk	gene	176	186	.	+	.	gene_id "G0010"; nb_tx "1";

Arguments:

$ gtftk nb_transcripts -h
  Usage: gtftk nb_transcripts [-i GTF] [-o GTF/TXT] [-f] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Compute the number of transcript per gene.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -f, --text-format  Return a text format. (default: False)
 -a, --key-name     If gtf format is requested, the name of the key. (default: nb_tx)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

seqid_list

Description: Returns the chromosome list.

Example: Returns the chromosome list.

$ gtftk get_example |  gtftk seqid_list
chr1

Arguments:

$ gtftk seqid_list -h
  Usage: gtftk seqid_list [-i GTF] [-o TXT] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select the seqid/chromosomes

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -s, --separator    The separator to be used for separating key names. (default: )

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

tss_dist

Description: Computes the distance between TSSs of pairs of gene transcripts. The tss_num_1/tss_num_1 columns contains the numbering of TSSs (transcript_id_1 and transcript_id_2 respectively) for each gene. Numering starts from 1 (most 5’ TSS) to the number of different TSS coordinates. Two or more transcripts will have the same tss_num if they share a TSS.

Example: Returns the chromosome list.

$ gtftk get_example -d mini_real |  gtftk tss_dist | head -n 10
gene_id	transcript_id_1	transcript_id_2	dist	tss_num_1	tss_num_2
ENSG00000187608	ENST00000624652	ENST00000379389	12278	2	3
ENSG00000187608	ENST00000624697	ENST00000379389	12285	1	3
ENSG00000187608	ENST00000624697	ENST00000624652	7	1	2
ENSG00000175756	ENST00000338338	ENST00000321751	326	1	3
ENSG00000175756	ENST00000321751	ENST00000338370	12	3	5
ENSG00000175756	ENST00000378853	ENST00000321751	4	2	3
ENSG00000175756	ENST00000321751	ENST00000489799	25	3	6
ENSG00000175756	ENST00000321751	ENST00000496905	10	3	4
ENSG00000175756	ENST00000338338	ENST00000338370	338	1	5

Arguments:

$ gtftk tss_dist -h
  Usage: gtftk tss_dist [-i GTF] [-o TXT] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Computes the distance between TSSs of pairs of gene transcripts.

  Notes:
     *  The tss_num_1/tss_num_1 columns contains the numbering of TSSs (transcript_id_1 and
     transcript_id_2 respectively) for each gene.
     *  Numering starts from 1 (most 5' TSS) to the number of different TSS coordinates.
     *  Thus two or more transcripts will have the same tss_num if they share a TSS.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

feature_size

Description: Get the size and limits (start/end) of features enclosed in the GTF. If bed format is requested returns the limits in bed format and the size as a score. Otherwise output GTF file with ‘feat_size’ as a new key and size as value

Example: Add trancript size (mature RNA) to the gtf.

$ gtftk get_example | gtftk feature_size -t mature_rna | gtftk select_by_key -k feature -v transcript | head -n 5
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; feat_size "14";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; feat_size "14";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; feat_size "10";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; feat_size "10";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; feat_size "8";

Example: Add trancript size (genomic coverage) to the gtf.

$ gtftk get_example | gtftk feature_size -t transcript | gtftk select_by_key -k feature -v transcript | head -n 5
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; feat_size "14";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; feat_size "14";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; feat_size "10";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; feat_size "12";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; feat_size "12";

Example: Get exon size and limits in BED format.

$ gtftk get_example | gtftk feature_size  -t exon -b -n feature,exon_id,gene_id| head -n 5
chr1	124	138	exon|G0001T002E001|G0001|exon	14	+
chr1	124	138	exon|G0001T001E001|G0001|exon	14	+
chr1	179	189	exon|G0002T001E001|G0002|exon	10	+
chr1	49	54	exon|G0003T001E001|G0003|exon	5	-
chr1	56	61	exon|G0003T001E002|G0003|exon	5	-

Arguments:

$ gtftk feature_size -h
  Usage: gtftk feature_size [-i GTF] [-o GTF/BED] [-t ft_type] [-n NAME] [-k KEY_NAME] [-s SEP] [-b] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get the size and limits (start/end) of features enclosed in the GTF. The feature can be of any
     type (as found in the 3rd column of the GTF) or 'mature_rna' to get transcript size (i.e
     without introns). If bed format is requested returns the limits in bed format and the size as a
     score. Otherwise output GTF file with 'feat_size' as a new key and size as value.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file (BED). (default: <stdout>)
 -t, --ft-type      A target feature (as found in the 3rd column of the GTF) or 'mature_rna' to get transcript size (without introns). (default: transcript)
 -n, --names        The key(s) that should be used as name if bed is requested. (default: transcript_id)
 -k, --key-name     Name for the new key if GTF format is requested. (default: feat_size)
 -s, --separator    The separator to be used for separating name elements (see -n). (default: |)
 -b, --bed          Output bed format. (default: False)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘Editing’

add_prefix

Description: Add a prefix (or suffix) to one of the attribute value (e.g. gene_id)

Example:

$ gtftk get_example| gtftk add_prefix -k transcript_id -t "novel_"| head -2
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "novel_G0001T002";
$ gtftk get_example| gtftk add_prefix -k transcript_id -t "_novel" -s | head -2
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002_novel";

Arguments:

$ gtftk add_prefix -h
  Usage: gtftk add_prefix [-i GTF] [-o GTF] [-k KEY] [-t TEXT] [-s] [-f target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Add a prefix to target values. By default add 'chr' to seqid/chromosome key.

  Version:  2018-01-20

Arguments:
 -i, --inputfile       Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile      Output file. (default: <stdout>)
 -k, --key             The name of the attribute for which a prefix/suffix is to be added to the corresponding values (e.g, gene_id, transcript_id,...). (default: chrom)
 -t, --text            The character string to add as a prefix to the values. (default: chr)
 -s, --suffix          The character string to add as a prefix to the values. (default: False)
 -f, --target-feature  The name of the target feature. (default: *)

Command-wise optional arguments:
 -h, --help            Show this help message and exit.
 -V, --verbosity       Increase output verbosity. (default: 0)
 -D, --no-date         Do not add date to output file names. (default: False)
 -C, --add-chr         Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir         Keep all temporary files into this folder. (default: None)
 -A, --keep-all        Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file     Stores the arguments passed to the command into a file. (default: None)

del_attr

Description: Delete an attribute and its corresponding values.

Example:

$ gtftk get_example | gtftk del_attr -k transcript_id,gene_id,exon_id | head -3
chr1	gtftk	gene	125	138	.	+	.	
chr1	gtftk	transcript	125	138	.	+	.	
chr1	gtftk	exon	125	138	.	+	.
$ gtftk get_example | gtftk del_attr -v  -k transcript_id,gene_id | head -3 # delete all but transcript_id and gene_id
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";

Arguments:

$ gtftk del_attr -h
  Usage: gtftk del_attr [-i GTF] [-o GTF] -k KEY [-r] [-v] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Delete one or several attributes from the gtf file.

  Notes:
     *  You may also use 'complex' regexp such as : "(^.*_id$|^.*_biotype$)"
     *  Example: gtftk get_example -d mini_real | gtftk del_attr -k "(^.*_id$|^.*_biotype$)" -r -v

  Version:  2018-01-20

Arguments:
 -i, --inputfile     Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile    Output file. (default: <stdout>)
 -k, --key           Comma separated list of attribute names or a regular expression (see -r). (default: None)
 -r, --reg-exp       The key name is a regular expression. (default: False)
 -v, --invert-match  Delected keys are those not matching any of the specified key. (default: False)

Command-wise optional arguments:
 -h, --help          Show this help message and exit.
 -V, --verbosity     Increase output verbosity. (default: 0)
 -D, --no-date       Do not add date to output file names. (default: False)
 -C, --add-chr       Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir       Keep all temporary files into this folder. (default: None)
 -A, --keep-all      Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file   Stores the arguments passed to the command into a file. (default: None)

join_attr

Description: Add attributes from a file. This command can be used to import additional key/values into the gtf (e.g CPAT for coding potential, DESeq for differential analysis,…). The imported file can be in 2 formats (2 columns or matrix):

  • With a 2-columns file:
    • value for joining (transcript_id or gene_id or …).
    • corresponding value.
  • With a matrix (see -m):
    • rows corresponding to joining keys (transcript_id or gene_id or…).
    • columns corresponding to novel attributes name.
    • Each cell of the matrix is a value for the corresponding attribute.

Example: With a 2-columns file.

$ gtftk get_example -f join > simple_join.txt
$ cat simple_join.txt
G0003	0.2322
G0004	0.999
G0009	0.5555
$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join.txt -n a_score -t gene| gtftk select_by_key -k feature -v gene
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003"; a_score "0.2322";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004"; a_score "0.999";
chr1	gtftk	gene	33	47	.	-	.	gene_id "G0005";
chr1	gtftk	gene	22	35	.	-	.	gene_id "G0006";
chr1	gtftk	gene	107	116	.	+	.	gene_id "G0007";
chr1	gtftk	gene	210	222	.	-	.	gene_id "G0008";
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; a_score "0.5555";
chr1	gtftk	gene	176	186	.	+	.	gene_id "G0010";

Example: With a matrix

$ gtftk get_example -f join_mat  >  simple_join_mat.txt
$ cat simple_join_mat.txt
genes	S1	S2
G0003	0.2322	0.4
G0004	0.999	0.6
G0009	0.5555	0.7
$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join_mat.txt -m -t gene| gtftk select_by_key -k feature -v gene
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003"; S1 "0.2322"; S2 "0.4";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004"; S1 "0.999"; S2 "0.6";
chr1	gtftk	gene	33	47	.	-	.	gene_id "G0005";
chr1	gtftk	gene	22	35	.	-	.	gene_id "G0006";
chr1	gtftk	gene	107	116	.	+	.	gene_id "G0007";
chr1	gtftk	gene	210	222	.	-	.	gene_id "G0008";
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	gene	176	186	.	+	.	gene_id "G0010";

Arguments:

$ gtftk join_attr -h
  Usage: gtftk join_attr [-i GTF] [-o GTF] -k KEY -j JOIN_FILE [-H] [-m] [-n NEW_KEY] [-t target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Join attributes from a tabulated file.

  Version:  2018-02-05

Arguments:
 -i, --inputfile       Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile      Output file. (default: <stdout>)
 -k, --key-to-join     The name of the key used to join (e.g transcript_id). (default: None)
 -j, --join-file       A two columns file with (i) the value for joining (e.g value for transcript_id), (ii) the value for novel key (e.g the coding potential computed value). (default: None)
 -H, --has-header      Indicates that the 'join-file' has a header. (default: False)
 -m, --matrix          'join-file' expect a matrix with row names as target keys column names as novel key and each cell as value. (default: False)
 -n, --new-key         The name of the novel key. Mutually exclusive with --matrix. (default: None)
 -t, --target-feature  The name(s) of the target feature(s). Comma separated. (default: None)

Command-wise optional arguments:
 -h, --help            Show this help message and exit.
 -V, --verbosity       Increase output verbosity. (default: 0)
 -D, --no-date         Do not add date to output file names. (default: False)
 -C, --add-chr         Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir         Keep all temporary files into this folder. (default: None)
 -A, --keep-all        Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file     Stores the arguments passed to the command into a file. (default: None)

join_multi_file

Description: Join attributes from mutiple files.

Example: Add key/value to gene feature.

$ gtftk get_example |  gtftk join_multi_file -k gene_id -t gene simple.join_mat_2 simple.join_mat_3| gtftk select_by_key -g
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003"; S3 "A"; S4 "B"; S5 "0.2322"; S6 "0.4";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004"; S3 "C"; S4 "D"; S5 "0.999|0.999"; S6 "0.6|0.6";
chr1	gtftk	gene	33	47	.	-	.	gene_id "G0005";
chr1	gtftk	gene	22	35	.	-	.	gene_id "G0006";
chr1	gtftk	gene	107	116	.	+	.	gene_id "G0007";
chr1	gtftk	gene	210	222	.	-	.	gene_id "G0008";
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; S3 "E"; S4 "F"; S5 "0.5555|20"; S6 "0.7|30";
chr1	gtftk	gene	176	186	.	+	.	gene_id "G0010";

Arguments:

$ gtftk join_multi_file -h
  Usage: gtftk join_multi_file [-i GTF] [-o GTF] -k KEY [-t target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L] matrice_files [matrice_files ...]

  Description: 
     *  Join attributes from mutiple files.

  Version:  2018-02-05

Arguments:
 -i, --inputfile       Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile      Output file. (default: <stdout>)
 -k, --key-to-join     The name of the key used to join (e.g transcript_id). (default: None)
 -t, --target-feature  The name(s) of the target feature(s). Comma separated. (default: None)
 matrice_files         'A set of matrix files with row names as target keys column names as novel key and each cell as value.

Command-wise optional arguments:
 -h, --help            Show this help message and exit.
 -V, --verbosity       Increase output verbosity. (default: 0)
 -D, --no-date         Do not add date to output file names. (default: False)
 -C, --add-chr         Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir         Keep all temporary files into this folder. (default: None)
 -A, --keep-all        Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file     Stores the arguments passed to the command into a file. (default: None)

merge_attr

Description: Merge a set of attributes into a destination attribute.

Example: Merge gene_id and transcript_id into a new key associated to transcript features.

$ gtftk get_example |  gtftk merge_attr -k transcript_id,gene_id -d txgn_id -s "|" -f transcript | gtftk select_by_key -t
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; txgn_id "G0001T002|G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; txgn_id "G0001T001|G0001";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; txgn_id "G0002T001|G0002";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; txgn_id "G0003T001|G0003";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; txgn_id "G0004T002|G0004";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; txgn_id "G0004T001|G0004";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; txgn_id "G0005T001|G0005";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; txgn_id "G0006T001|G0006";
chr1	gtftk	transcript	28	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; txgn_id "G0006T002|G0006";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001"; txgn_id "G0007T001|G0007";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T002"; txgn_id "G0007T002|G0007";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; txgn_id "G0008T001|G0008";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; txgn_id "G0009T002|G0009";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; txgn_id "G0009T001|G0009";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001"; txgn_id "G0010T001|G0010";

Example: Merge gene_id and transcript_id into an existing key (transcript_id) associated to transcript features.

$ gtftk get_example |  gtftk merge_attr -k transcript_id,gene_id -d transcript_id -s "|" -f transcript | gtftk tabulate -k feature,transcript_id -x | head -n 6
feature	transcript_id
gene	?
transcript	G0001T002|G0001
exon	G0001T002
CDS	G0001T002
transcript	G0001T001|G0001

Example: Merge gene_id and transcript_id into an existing key (transcript_id) associated to all features.

$ gtftk get_example |  gtftk merge_attr -k transcript_id,gene_id -d transcript_id -s "|" -f "*" | gtftk tabulate -k feature,transcript_id -x | head -n 6
Traceback (most recent call last):
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/bin/gtftk", line 4, in <module>
    __import__('pkg_resources').run_script('pygtftk==0.9.6', 'gtftk')
  File "/Users/puthier/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/puthier/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(code, namespace, namespace)
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/EGG-INFO/scripts/gtftk", line 83, in <module>
    args = main()
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/EGG-INFO/scripts/gtftk", line 68, in main
    CmdManager.run(args)
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/cmd_manager.py", line 987, in run
    fun(**args)
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/plugins/merge_attr.py", line 92, in merge_attr
    separator).write(outputfile,
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/gtf_interface.py", line 538, in merge_attr
    self = self.add_attr_column(tmp_file, new_key)
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pygtftk-0.9.6-py3.6-macosx-10.9-x86_64.egg/pygtftk/gtf_interface.py", line 3369, in add_attr_column
    tmp = input_file.readlines()
  File "/Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/tempfile.py", line 485, in func_wrapper
    return func(*args, **kwargs)
io.UnsupportedOperation: not readable
feature	transcript_id

Arguments:

$ gtftk join_multi_file -h
  Usage: gtftk join_multi_file [-i GTF] [-o GTF] -k KEY [-t target_feature] [-h] [-V ] [-D] [-C] [-K] [-A] [-L] matrice_files [matrice_files ...]

  Description: 
     *  Join attributes from mutiple files.

  Version:  2018-02-05

Arguments:
 -i, --inputfile       Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile      Output file. (default: <stdout>)
 -k, --key-to-join     The name of the key used to join (e.g transcript_id). (default: None)
 -t, --target-feature  The name(s) of the target feature(s). Comma separated. (default: None)
 matrice_files         'A set of matrix files with row names as target keys column names as novel key and each cell as value.

Command-wise optional arguments:
 -h, --help            Show this help message and exit.
 -V, --verbosity       Increase output verbosity. (default: 0)
 -D, --no-date         Do not add date to output file names. (default: False)
 -C, --add-chr         Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir         Keep all temporary files into this folder. (default: None)
 -A, --keep-all        Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file     Stores the arguments passed to the command into a file. (default: None)

discretize_key

Description: Create a new key by discretizing a numeric key. This can be helpful to create new classes on the fly that can be used subsequently. The default is to create equally spaced interval. The intervals can also be created by computing the percentiles (-p).

Example: Let say we have the following matrix giving expression level of genes (rows) in samples (columns). We could join this information to the GTF and later choose to transform key S1 into a new discretized key S1_d. We may apply particular labels to this factor using -l.

$ gtftk get_example |  gtftk join_attr -j simple.join_mat -k gene_id -m | gtftk discretize_key -k S1 -d S1_d -n 2 | gtftk select_by_key -k feature -v gene
 |-- 17:43-INFO-discretize_key : Categories: ['(0.231_0.616]', '(0.616_0.999]']
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003"; S1 "0.2322"; S2 "0.4"; S1_d "(0.231_0.616]";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004"; S1 "0.999"; S2 "0.6"; S1_d "(0.616_0.999]";
chr1	gtftk	gene	33	47	.	-	.	gene_id "G0005";
chr1	gtftk	gene	22	35	.	-	.	gene_id "G0006";
chr1	gtftk	gene	107	116	.	+	.	gene_id "G0007";
chr1	gtftk	gene	210	222	.	-	.	gene_id "G0008";
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; S1 "0.5555"; S2 "0.7"; S1_d "(0.231_0.616]";
chr1	gtftk	gene	176	186	.	+	.	gene_id "G0010";
$ gtftk get_example |  gtftk join_attr -j simple.join_mat -k gene_id -m | gtftk discretize_key -k S1 -d S1_d -n 2 -l A,B  | gtftk select_by_key -k feature -v gene
 |-- 17:43-INFO-discretize_key : Categories: ['A', 'B']
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003"; S1 "0.2322"; S2 "0.4"; S1_d "A";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004"; S1 "0.999"; S2 "0.6"; S1_d "B";
chr1	gtftk	gene	33	47	.	-	.	gene_id "G0005";
chr1	gtftk	gene	22	35	.	-	.	gene_id "G0006";
chr1	gtftk	gene	107	116	.	+	.	gene_id "G0007";
chr1	gtftk	gene	210	222	.	-	.	gene_id "G0008";
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; S1 "0.5555"; S2 "0.7"; S1_d "A";
chr1	gtftk	gene	176	186	.	+	.	gene_id "G0010";

Arguments:

$ gtftk discretize_key -h
  Usage: gtftk discretize_key [-i GTF] [-o GTF] -k src_key -d dest_key -n KEY [-l labels] [-p] [-g] [-u] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Create a new key by discretizing a numeric key. This can be helpful to create new classes on the
     fly that can be used subsequently.

  Notes:
     *  if ---ft-type is not set the destination key will be assigned to all feature containing the
     source key.
     *  Non-numeric value for source key will be translated into 'NA'.
     *  The default is to create equally spaced interval. The interval can also be created by
     computing the percentiles (-p).

  Version:  2018-01-20

Arguments:
 -i, --inputfile            Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile           Output file. (default: <stdout>)
 -k, --src-key              The name of the source key (default: None)
 -d, --dest-key             The name of the target key. (default: None)
 -n, --nb-levels            The number of levels/classes to create. (default: 2)
 -l, --labels               A comma separated list of labels of size --nb-levels. (default: None)
 -p, --percentiles          Compute --nb-levels classes using percentiles. (default: False)
 -g, --log                  Compute breaks based on log-scale. (default: False)
 -u, --percentiles-of-uniq  Compute percentiles based on non-redondant values. (default: False)

Command-wise optional arguments:
 -h, --help                 Show this help message and exit.
 -V, --verbosity            Increase output verbosity. (default: 0)
 -D, --no-date              Do not add date to output file names. (default: False)
 -C, --add-chr              Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir              Keep all temporary files into this folder. (default: None)
 -A, --keep-all             Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file          Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘selection’

select_by_key

Description: Extract lines from the gtf based on key and values.

Example: Select some features (genes) then some gene_id.

$ gtftk get_example |gtftk select_by_key -k feature -v gene | gtftk select_by_key -k gene_id -v G0002,G0003,G0004
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004";

Example: Select gene list in column 1 of file simple_join.txt.

$ gtftk get_example -f join > simple_join.txt ; gtftk get_example| gtftk select_by_key -f simple_join.txt -c 1 -k gene_id | gtftk tabulate -k gene_id -Hun
G0003
G0004
G0009

Example: Select the gene list enclosed in column 1 of file simple_join.txt. Ask for bed format.

$ gtftk get_example -f join > simple_join.txt ; gtftk get_example| gtftk select_by_key -f simple_join.txt -c 1 -k gene_id -b
chr1	49	61	G0003|?	.	-
chr1	49	61	G0003|G0003T001	.	-
chr1	49	54	G0003|G0003T001	.	-
chr1	56	61	G0003|G0003T001	.	-
chr1	49	52	G0003|G0003T001	.	-
chr1	64	76	G0004|?	.	+
chr1	64	76	G0004|G0004T002	.	+
chr1	64	68	G0004|G0004T002	.	+
chr1	70	71	G0004|G0004T002	.	+
chr1	73	76	G0004|G0004T002	.	+
chr1	65	68	G0004|G0004T002	.	+
chr1	70	71	G0004|G0004T002	.	+
chr1	73	75	G0004|G0004T002	.	+
chr1	64	76	G0004|G0004T001	.	+
chr1	64	68	G0004|G0004T001	.	+
chr1	70	71	G0004|G0004T001	.	+
chr1	73	76	G0004|G0004T001	.	+
chr1	64	67	G0004|G0004T001	.	+
chr1	2	14	G0009|?	.	-
chr1	2	14	G0009|G0009T002	.	-
chr1	2	14	G0009|G0009T002	.	-
chr1	4	10	G0009|G0009T002	.	-
chr1	2	14	G0009|G0009T001	.	-
chr1	2	14	G0009|G0009T001	.	-
chr1	2	8	G0009|G0009T001	.	-

Example: Select all but genes in column 1 of file simple_join.txt.

$ gtftk get_example -f join > simple_join.txt ; gtftk get_example| gtftk select_by_key -f simple_join.txt -c 1 -k gene_id -n | gtftk tabulate -k gene_id -Hun
G0001
G0002
G0005
G0006
G0007
G0008
G0010

Arguments:

$ gtftk select_by_key -h
  Usage: gtftk select_by_key [-i GTF] [-o GTF] [-k KEY] [-v VALUE] [-f FILE] [-c COL] [-n] [-b] [-m NAME] [-s SEP] [-l] [-t] [-g] [-e] [-d] [-a] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select lines from a GTF file based on attributes and associated values.

  Version:  2018-01-31

optional arguments:
 -v, --value               A comma separated list of values. (default: None)
 -f, --file-with-values    A file containing values as a single column. (default: None)
 -t, --select-transcripts  A shortcuts for "-k feature -v transcript". (default: False)
 -g, --select-genes        A shortcuts for "-k feature -v gene". (default: False)
 -e, --select-exons        A shortcuts for "-k feature -v exon". (default: False)
 -d, --select-cds          A shortcuts for "-k feature -v CDS". (default: False)
 -a, --select-start-codon  A shortcuts for "-k feature -v start_codon". (default: False)

Arguments:
 -i, --inputfile           Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile          Output file. (default: <stdout>)
 -k, --key                 The key name. (default: None)
 -c, --col                 The column number (one-based) that contains the values in the file. File is tab-delimited. (default: 1)
 -n, --invert-match        Not/invert match. Selected lines whose requested key is not associated with the requested value. (default: False)
 -b, --bed-format          Ask for bed format output. (default: False)
 -m, --names               If Bed output. The key(s) that should be used as name. (default: gene_id,transcript_id)
 -s, --separator           If Bed output. The separator to be used for separating name elements (see -n). (default: |)
 -l, --log                 Print some statistics about selected features. To be used in conjunction with -V 1/2. (default: False)

Command-wise optional arguments:
 -h, --help                Show this help message and exit.
 -V, --verbosity           Increase output verbosity. (default: 0)
 -D, --no-date             Do not add date to output file names. (default: False)
 -C, --add-chr             Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir             Keep all temporary files into this folder. (default: None)
 -A, --keep-all            Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file         Stores the arguments passed to the command into a file. (default: None)

select_by_regexp

Description: Select lines by testing values of a particular key with a regular expression

Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.

$ gtftk get_example -d mini_real |  gtftk select_by_regexp -k gene_name -r "BCL.*" | gtftk tabulate -Hun -k gene_name
BCL2L2-PABPN1
BCL7B

Arguments:

$ gtftk select_by_regexp -h
  Usage: gtftk select_by_regexp [-i GTF] [-o GTF] -k KEY -r regexp [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select lines from a GTF file based on a regexp.

  Version:  2018-01-20

optional arguments:
 -i, --inputfile     Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile    Output file. (default: <stdout>)
 -k, --key           The key name (default: chrom)
 -r, --regexp        The regular expression. (default: ^chr[0-9XY]+$)
 -n, --invert-match  Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)

Command-wise optional arguments:
 -h, --help          Show this help message and exit.
 -V, --verbosity     Increase output verbosity. (default: 0)
 -D, --no-date       Do not add date to output file names. (default: False)
 -C, --add-chr       Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir       Keep all temporary files into this folder. (default: None)
 -A, --keep-all      Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file   Stores the arguments passed to the command into a file. (default: None)

select_by_intron_size

Description: Delete genes containing an intron whose size is below s. If -m is selected, any gene whose sum of intronic region length is above s is deleted. Monoexonic genes are kept.

Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.

$ gtftk get_example -d mini_real |  gtftk select_by_regexp -k gene_name -r "BCL.*"  | gtftk tabulate -Hun -k gene_name
BCL2L2-PABPN1
BCL7B

Arguments:

$ gtftk select_by_regexp -h
  Usage: gtftk select_by_regexp [-i GTF] [-o GTF] -k KEY -r regexp [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select lines from a GTF file based on a regexp.

  Version:  2018-01-20

optional arguments:
 -i, --inputfile     Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile    Output file. (default: <stdout>)
 -k, --key           The key name (default: chrom)
 -r, --regexp        The regular expression. (default: ^chr[0-9XY]+$)
 -n, --invert-match  Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)

Command-wise optional arguments:
 -h, --help          Show this help message and exit.
 -V, --verbosity     Increase output verbosity. (default: 0)
 -D, --no-date       Do not add date to output file names. (default: False)
 -C, --add-chr       Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir       Keep all temporary files into this folder. (default: None)
 -A, --keep-all      Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file   Stores the arguments passed to the command into a file. (default: None)

select_by_max_exon_nb

Description: For each gene select the transcript with the highest number of exons.

Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.

$ gtftk get_example |  gtftk select_by_max_exon_nb | gtftk select_by_key -t
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001";

Arguments:

$ gtftk select_by_max_exon_nb -h
  Usage: gtftk select_by_max_exon_nb [-i GTF] [-o GTF] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  For each gene select the transcript with the highest number of exons.

  Version:  2018-02-11

optional arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

select_by_loc

Description: Select transcripts/gene overlapping a given locations. A transcript is defined here as the genomic region from TSS to TTS including introns. This function will return the transcript and all its associated elements (exons, utr,…) even if only a fraction (e.g intron) of the transcript is overlapping the feature. If -/-ft-type is set to ‘gene’ returns the gene and all its associated elements.

Example: Select transcripts at a given location.

$ gtftk get_example | gtftk select_by_key -k feature -v transcript | gtftk  select_by_loc -l chr1:10-15
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001";

Arguments:

$ gtftk select_by_loc -h
  Usage: gtftk select_by_loc [-i GTF] [-o GTF] (-l LOC | -f BEDFILE) [-t {transcript,gene}] [-n] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select transcripts/gene overlapping a given locations.

  Notes:
     *  A transcript is defined here as the genomic region from TSS to TTS including introns.
     *  This function will return the transcript and all its associated elements (exons, utr,...)
     even if only a fraction (e.g intron) of the transcript is overlapping the feature.
     *  If -/-ft-type is set to 'gene' returns the gene and all its associated elements.

  Version:  2018-01-20

optional arguments:
 -i, --inputfile      Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile     Output file. (default: <stdout>)
 -l, --location       List of chromosomal locations (chr:start-end[,chr:start-end]). 0-based (default: None)
 -f, --location-file  Bed file with chromosomal location. (default: None)
 -t, --ft-type        The feature of interest. (default: transcript)
 -n, --invert-match   Not/invert match. Select transcript not overlapping. (default: False)

Command-wise optional arguments:
 -h, --help           Show this help message and exit.
 -V, --verbosity      Increase output verbosity. (default: 0)
 -D, --no-date        Do not add date to output file names. (default: False)
 -C, --add-chr        Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir        Keep all temporary files into this folder. (default: None)
 -A, --keep-all       Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file    Stores the arguments passed to the command into a file. (default: None)

select_by_nb_exon

Description: Select transcripts based on the number of exons.

Example:

$ gtftk get_example |  gtftk select_by_nb_exon -m 2 | gtftk nb_exons| gtftk select_by_key -t
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; nb_exons "2";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; nb_exons "3";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; nb_exons "3";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; nb_exons "2";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; nb_exons "3";
chr1	gtftk	transcript	28	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; nb_exons "2";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; nb_exons "2";

Arguments:

$ gtftk select_by_nb_exon -h
  Usage: gtftk select_by_nb_exon [-i GTF] [-o GTF] [-m min_exon_number] [-M max_exon_number] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select transcripts based on the number of exons.

  Version:  2018-01-20

optional arguments:
 -i, --inputfile        Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile       Output file. (default: <stdout>)
 -m, --min-exon-number  Minimum number of exons. (default: 0)
 -M, --max-exon-number  Maximum number of exons. (default: None)

Command-wise optional arguments:
 -h, --help             Show this help message and exit.
 -V, --verbosity        Increase output verbosity. (default: 0)
 -D, --no-date          Do not add date to output file names. (default: False)
 -C, --add-chr          Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir          Keep all temporary files into this folder. (default: None)
 -A, --keep-all         Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file      Stores the arguments passed to the command into a file. (default: None)

select_by_numeric_value

Description: Select lines from a GTF file based on a boolean test on numeric values.

Example:
$ gtftk join_attr -i simple.gtf  -j simple.join_mat -k gene_id -m|  gtftk select_by_numeric_value -t 'start < 10 and end > 10 and S1 == 0.5555 and S2 == 0.7' -n ".,?"
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001"; S1 "0.5555"; S2 "0.7";

Arguments:

$ gtftk select_by_numeric_value -h
  Usage: gtftk select_by_numeric_value [-i GTF] [-o GTF] -t test [-n na_omit] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select lines from a GTF file based on a boolean test on numeric values.

  Version:  2018-01-20

optional arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -t, --test         The test to be applied. (default: None)
 -n, --na-omit      If one of the evaluated values is enclosed in this list (csv), line is skipped. (default: None)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

random_list

Description: Select a random list of genes or transcripts.

Example: Select randomly 3 transcripts.

$ gtftk get_example | gtftk random_list -n 3| gtftk count
transcript	3
exon	6
CDS	5

Arguments:

$ gtftk random_list -h
  Usage: gtftk random_list [-i GTF] [-o GTF] [-n NUMBER] [-t {gene,transcript}] [-s SEED] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select a random list of genes or transcripts. Note that if transcripts are requested the 'gene'
     feature is not returned.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -n, --number       The number of transcripts or gene to select. (default: 1)
 -t, --ft-type      The type of feature. (default: transcript)
 -s, --seed-value   Seed value for the random number generator. (default: None)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

random_tx

Description: Select randomly up to m transcript for each gene.

Example: Select randomly 1 transcript per gene (-m 1).

$ gtftk get_example |  gtftk random_tx -m 1| gtftk select_by_key -k feature -v gene,transcript| gtftk tabulate -k gene_id,transcript_id
gene_id	transcript_id
G0001	G0001T001
G0002	G0002T001
G0003	G0003T001
G0004	G0004T002
G0005	G0005T001
G0006	G0006T001
G0007	G0007T001
G0008	G0008T001
G0009	G0009T001
G0010	G0010T001

Arguments:

$ gtftk random_tx -h
  Usage: gtftk random_tx [-i GTF] [-o GTF] [-m MAX] [-s SEED] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select randomly up to m transcript for each gene.

  Version:  2018-01-30

Arguments:
 -i, --inputfile       Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile      Output file. (default: <stdout>)
 -m, --max-transcript  The maximum number of transcripts to select for each gene. (default: 1)
 -s, --seed-value      Seed value for the random number generator. (default: None)

Command-wise optional arguments:
 -h, --help            Show this help message and exit.
 -V, --verbosity       Increase output verbosity. (default: 0)
 -D, --no-date         Do not add date to output file names. (default: False)
 -C, --add-chr         Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir         Keep all temporary files into this folder. (default: None)
 -A, --keep-all        Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file     Stores the arguments passed to the command into a file. (default: None)

rm_dup_tss

Description: If several transcripts of a gene share the same tss, select only one.

Example: Use rm_dup_tss to select transcripts that will be used for mk_matrix -k 5 (see later).

$ gtftk get_example |  gtftk rm_dup_tss| gtftk select_by_key -k feature -v transcript
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001";

Arguments:

$ gtftk rm_dup_tss -h
  Usage: gtftk rm_dup_tss [-i GTF] [-o GTF] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  If several transcripts of a gene share the same TSS, select one transcript per TSS.

  Notes:
     *  The alphanumeric order of transcript_id is used to select the representative of a TSS.

  Version:  2018-01-20

Argument:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

select_by_go

Description: Select genes from a GTF file using a Gene Ontology ID (e.g GO:0050789).

Example: Select genes with transcription factor activity from the GTF. They could be used subsequently to test their epigenetic features (see later).

$ gtftk get_example -d mini_real -f gtf| gtftk select_by_go -s hsapiens | gtftk select_by_key -k feature -v gene | gtftk tabulate -k gene_id,gene_name -Hun | head -6
ENSG00000142611	PRDM16
ENSG00000069812	HES2
ENSG00000127124	HIVEP3
ENSG00000162367	TAL1
ENSG00000143006	DMRTB1
ENSG00000054267	ARID4B

Arguments:

$ gtftk select_by_go -h
  Usage: gtftk select_by_go [-i GTF] [-o GTF] [-g go_id] (-l | -s species) [-n] [-p1 http_proxy] [-p2 https_proxy] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select lines/genes from a GTF file using a Gene Ontology ID (e.g GO:0097194).

  Version:  2018-01-20

optional arguments:
 -i, --inputfile      Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile     Output file. (default: <stdout>)
 -g, --go-id          The GO ID (with or without "GO:" prefix). (default: GO:0003700)
 -l, --list-datasets  Do not select lines. Only get a list of available datasets/species. (default: False)
 -s, --species        The dataset/species. (default: None)
 -n, --invert-match   Not/invert match. (default: False)
 -p1, --http-proxy    Use this http proxy (not tested/experimental). (default: )
 -p2, --https-proxy   Use this https proxy (not tested/experimental). (default: )

Command-wise optional arguments:
 -h, --help           Show this help message and exit.
 -V, --verbosity      Increase output verbosity. (default: 0)
 -D, --no-date        Do not add date to output file names. (default: False)
 -C, --add-chr        Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir        Keep all temporary files into this folder. (default: None)
 -A, --keep-all       Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file    Stores the arguments passed to the command into a file. (default: None)

select_by_tx_size

Description: Select transcript based on their size (i.e size of mature/spliced transcript).

Example:

$ gtftk get_example | gtftk feature_size -t mature_rna |  gtftk select_by_tx_size -m 14 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
gene_id	transcript_id	feat_size
G0001	G0001T002	14
G0001	G0001T001	14
$ gtftk get_example | gtftk feature_size -t mature_rna |  gtftk select_by_tx_size -m 11 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
gene_id	transcript_id	feat_size
G0001	G0001T002	14
G0001	G0001T001	14
G0009	G0009T002	12
G0009	G0009T001	12
G0010	G0010T001	11
$ gtftk get_example -d mini_real | gtftk feature_size -t mature_rna |  gtftk select_by_tx_size -m 8000  -M 1000000000 | gtftk tabulate -n -k gene_id,transcript_id,feat_size -H  | sort -k3,3n | tail -n 10
ENSG00000215182	ENST00000621226	17448
ENSG00000127603	ENST00000361689	17538
ENSG00000127603	ENST00000289893	19141
ENSG00000173517	ENST00000560626	19217
ENSG00000229807	ENST00000429829	19275
ENSG00000129682	ENST00000315930	19519
ENSG00000280383	ENST00000623075	23112
ENSG00000127603	ENST00000372915	23440
ENSG00000127603	ENST00000567887	24319
ENSG00000127603	ENST00000564288	24828

Arguments:

$ gtftk select_by_tx_size -h
  Usage: gtftk select_by_tx_size [-i GTF] [-o GTF] [-m min_size] [-M max_size] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select transcript based on their size (i.e size of mature/spliced transcript).

  Version:  2018-01-20

optional arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -m, --min-size     Minimum size. (default: 0)
 -M, --max-size     Maximum size. (default: 1000000000)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

select_most_5p_tx

Description: Select the most 5’ transcript of each gene.

Example:

$ gtftk get_example | gtftk select_most_5p_tx | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id	transcript_id
G0001	G0001T002
G0002	G0002T001
G0003	G0003T001
G0004	G0004T002
G0005	G0005T001
G0006	G0006T001
G0007	G0007T001
G0008	G0008T001
G0009	G0009T002
G0010	G0010T001

Arguments:

$ gtftk select_most_5p_tx -h
  Usage: gtftk select_most_5p_tx [-i GTF] [-o GTF] [-g] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select the most 5' transcript of each gene.

  Notes:
     *  If several transcript share the samemost 5' TSS, only one transcript is selected.

  Version:  2018-01-20

optional arguments:
 -i, --inputfile        Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile       Output file. (default: <stdout>)
 -g, --keep-gene-lines  Add gene lines to the output (default: False)

Command-wise optional arguments:
 -h, --help             Show this help message and exit.
 -V, --verbosity        Increase output verbosity. (default: 0)
 -D, --no-date          Do not add date to output file names. (default: False)
 -C, --add-chr          Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir          Keep all temporary files into this folder. (default: None)
 -A, --keep-all         Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file      Stores the arguments passed to the command into a file. (default: None)

short_long

Description: Get the shortest or longest transcript of each gene

Example:

$ gtftk get_example | gtftk short_long | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id	transcript_id
G0001	G0001T002
G0002	G0002T001
G0003	G0003T001
G0004	G0004T002
G0005	G0005T001
G0006	G0006T002
G0007	G0007T001
G0008	G0008T001
G0009	G0009T002
G0010	G0010T001

Arguments:

$ gtftk short_long -h
  Usage: gtftk short_long [-i GTF] [-o GTF] [-l] [-g] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select the shortest mature transcript (i.e without introns) for each gene or the longest if the -l
     arguments is used.

  Notes:
     *

  Version:  2018-01-25

Argument:
 -i, --inputfile        Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile       Output file. (default: <stdout>)
 -l, --longs            Take the longest transcript of each gene (default: False)
 -g, --keep-gene-lines  Add gene lines to the output (default: False)

Command-wise optional arguments:
 -h, --help             Show this help message and exit.
 -V, --verbosity        Increase output verbosity. (default: 0)
 -D, --no-date          Do not add date to output file names. (default: False)
 -C, --add-chr          Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir          Keep all temporary files into this folder. (default: None)
 -A, --keep-all         Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file      Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘convertion’

convert

Description: This command can be used to convert to various formats. Currently only a limited number is supported.

  • bed: classical bed6 format.
  • bed6: classical bed6 format.
  • bed3: bed3 format.

Example: Get the gene features and convert them to bed6.

$ gtftk get_example | gtftk select_by_key -k feature -v gene | gtftk convert -n gene_id | head -n 3
chr1	124	138	G0001	.	+
chr1	179	189	G0002	.	+
chr1	49	61	G0003	.	-

Example: Get the gene features and convert them to bed3.

$ gtftk get_example | gtftk select_by_key -k feature -v gene | gtftk convert -f bed3 | head -n 3
chr1	124	138
chr1	179	189
chr1	49	61

Example: Get the exonic features and convert them to bed3.

$ gtftk get_example | gtftk select_by_key -k feature -v exon | gtftk convert -n gene_id,transcript_id,exon_id | head -3
chr1	124	138	G0001|G0001T002|G0001T002E001	.	+
chr1	124	138	G0001|G0001T001|G0001T001E001	.	+
chr1	179	189	G0002|G0002T001|G0002T001E001	.	+

Arguments:

$ gtftk convert -h
  Usage: gtftk convert [-i GTF] [-o BED/BED3/BED6] [-n NAME] [-s SEP] [-m more_names] [-f {bed,bed3,bed6}] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Convert a GTF to various format (still limited).

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -n, --names        The key(s) that should be used as name. (default: gene_id,transcript_id)
 -s, --separator    The separator to be used for separating name elements (see -n). (default: |)
 -m, --more-names   Add this information to the 'name' column of the BED file. (default: )
 -f, --format       Currently one of bed3, bed6 (default: bed6)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

tabulate

Description: Extract key/values from the GTF and convert them to tabulated format. When requesting coordinates they will be provided in 1-based format.

Example: Simply get the list of transcripts and gene.

$ gtftk get_example -f gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id -s "|"
gene_id|transcript_id
G0001|G0001T002
G0001|G0001T001
G0002|G0002T001
G0003|G0003T001
G0004|G0004T002
G0004|G0004T001
G0005|G0005T001
G0006|G0006T001
G0006|G0006T002
G0007|G0007T001
G0007|G0007T002
G0008|G0008T001
G0009|G0009T002
G0009|G0009T001
G0010|G0010T001

Example: Join novel attributes (see join_attr examples) and convert the resulting GTF stream to tab format

$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join.txt -n a_score -t gene| gtftk select_by_key -k feature -v gene| gtftk tabulate -k feature,start,end,seqid,gene_id,a_score
feature	start	end	seqid	gene_id	a_score
gene	50	61	chr1	G0003	0.2322
gene	65	76	chr1	G0004	0.999
gene	3	14	chr1	G0009	0.5555

Example: You may also delete the header, ask for non redondant lines and delete any lines containing not-available values (‘.’).

$ gtftk get_example -f gtf | gtftk join_attr -k gene_id -j simple_join.txt -n a_score -t gene| gtftk select_by_key -k feature -v gene| gtftk tabulate -k feature,start,end,seqid,gene_id,a_score -Hun
gene	50	61	chr1	G0003	0.2322
gene	65	76	chr1	G0004	0.999
gene	3	14	chr1	G0009	0.5555

Arguments:

$ gtftk tabulate -h
  Usage: gtftk tabulate [-i GTF] [-o TXT] [-s SEPARATOR] [-k KEY,KEY,...] [-u] [-H] [-n] [-x] [-b] [-t | -g | -a | -e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Convert a GTF to tabulated format.

  Notes:
     *  To refer to default keys use: seqid,source,feature,start,end,frame,gene_id...
     *  Note that 'all' or '*' are special keys that can be used to convert the whole GTF into a
     tabulated file. Thanks @fafa13.

  Version:  2018-01-20

optional arguments:
 -t, --select-transcript-ids  A shortcuts for "-k transcript_id". (default: False)
 -g, --select-gene_ids        A shortcuts for "-k gene_id". (default: False)
 -a, --select-gene-names      A shortcuts for "-k gene_name". (default: False)
 -e, --select-exon-ids        A shortcuts for "-k exon_ids". (default: False)

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -s, --separator              The output field separator. (default: )
 -k, --key                    A comma separated list of key names. (default: *)
 -u, --unique                 Print a non redondant list of lines. (default: False)
 -H, --no-header              Don't print the header line. (default: False)
 -n, --no-unset               Don't print lines containing '.' (unsetined values) (default: False)
 -x, --accept-undef           Print line for which the key is undefined (i.e, '?', does not exists). (default: False)
 -b, --no-basic               In case key is set to 'all' or '*', don't write basic attributes. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Increase output verbosity. (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)

bed_to_gtf

Description: Convert a bed file to gtf-like format.

Example:

$ gtftk get_example |gtftk convert| gtftk bed_to_gtf -t transcript | head -n 5
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|?"; transcript_id "G0001|?";
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1	Unknown	transcript	125	130	.	+	.	gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|G0001T001"; transcript_id "G0001|G0001T001";

Arguments:

$ gtftk bed_to_gtf -h
  Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
     lots of empty fields...sniff). May be helpful sometimes...

  Version:  2018-02-11

Arguments:
 -i, --inputfile    Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -t, --ft-type      The type of features you are trying to mimic... (default: transcript)
 -s, --source       The source of annotation. (default: Unknown)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

convert_ensembl

Description: Convert the GTF file to ensembl format. Essentially add ‘transcript’/’gene’ features.

Example: Delete gene and transcript feature. Regenerate them.

$ gtftk get_example | gtftk select_by_key -k feature -v gene,transcript -n| gtftk convert_ensembl | gtftk select_by_key -k gene_id -v G0001
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1	gtftk	CDS	125	130	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; ccds_id "CDS_G0001T002";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1	gtftk	CDS	130	132	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; ccds_id "CDS_G0001T001";

Arguments:

$ gtftk bed_to_gtf -h
  Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
     lots of empty fields...sniff). May be helpful sometimes...

  Version:  2018-02-11

Arguments:
 -i, --inputfile    Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -t, --ft-type      The type of features you are trying to mimic... (default: transcript)
 -s, --source       The source of annotation. (default: Unknown)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘annotation’

closest_genes

Description: Find the n closest genes for each transcript.

Example:

$ gtftk get_example |  bedtools sort | gtftk closest_genes -f
genes	closest_genes	distances
G0009	G0006	21
G0006	G0005	12
G0005	G0006	12
G0003	G0004	4
G0004	G0003	4
G0007	G0001	18
G0001	G0007	18
G0010	G0002	4
G0002	G0010	4
G0008	G0002	42

Arguments:

$ gtftk closest_genes -h
  Usage: gtftk closest_genes [-i GTF] [-o GTF/TXT] [-r {tss,tts,gene}] [-nb nb_neighbors] [-t {tss,tts,gene}] [-s] [-S] [-f] [-H] [-k] [-id {gene_id,gene_name}] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Find the n closest genes for each genes.

  Notes:
     *  The reference region for each gene can be the TSS (the most 5'), the TTS (The most 3') or
     the whole gene.
     *  The reference region for each closest gene can be the TSS, the whole gene or the TTS.
     *  The closest genes can be searched in a stranded or unstranded fashion.

  Version:  2018-02-11

optional arguments:
 -i, --inputfile          Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile         Output file. (default: <stdout>)

Arguments:
 -r, --from-region-type   What is region to consider for each gene. (default: tss)
 -nb, --nb-neighbors      The size of the neighborhood. (default: 1)
 -t, --to-region-type     What is region to consider for each closest gene. (default: tss)
 -s, --same-strandedness  Require same strandedness (default: False)
 -S, --diff-strandedness  Require different strandedness (default: False)
 -f, --text-format        Return a text format. (default: False)
 -H, --no-header          Don't print the header line. (default: False)
 -k, --collapse           Unwrap. Don't use comma. Print closest genes line by line. (default: False)
 -id, --identifier        The key used as gene identifier. (default: gene_id)

Command-wise optional arguments:
 -h, --help               Show this help message and exit.
 -V, --verbosity          Increase output verbosity. (default: 0)
 -D, --no-date            Do not add date to output file names. (default: False)
 -C, --add-chr            Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir            Keep all temporary files into this folder. (default: None)
 -A, --keep-all           Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file        Stores the arguments passed to the command into a file. (default: None)

overlapping

Description: Find transcripts whose body/TSS/TTS region extended in 5’ and 3’ (-u/-d) overlaps with any transcript from another gene. Strandness is not considered by default. Used –invert-match to find those that do not overlap. If –annotate-gtf is used, all lines of the input GTF file will be printed and a new key containing the list of overlapping transcripts will be added to the transcript features/lines (key will be ‘overlapping_*’ with * one of body/TSS/TTS). The –annotate-gtf and –invert-match arguments are mutually exclusive.

Example: Find transcript whose promoter overlap transcript from other genes.

$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt;  gtftk get_example | gtftk overlapping -c simple_join_chromInfo.txt -t promoter -u 10 -d 10 -a    | gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,overlap_promoter_u0.01k_d0.01k | head
transcript_id	overlap_promoter_u0.01k_d0.01k
G0001T002	G0007T001,G0007T002
G0001T001	G0007T001,G0007T002
G0002T001	G0010T001
G0003T001	G0004T002,G0004T001
G0004T002	G0003T001
G0004T001	G0003T001
G0005T001	G0003T001
G0006T001	G0005T001
G0006T002	G0005T001

Example: Find transcript whose tts overlap transcript from other genes (on the other strand).

$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt;  gtftk get_example | gtftk overlapping -c simple_join_chromInfo.txt -t tts -u 30 -d 30 -a -S     | gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,overlap_tts_u0.03k_d0.03k | head
transcript_id	overlap_tts_u0.03k_d0.03k
G0002T001	G0008T001
G0003T001	G0004T002,G0004T001
G0004T002	G0003T001,G0005T001
G0004T001	G0003T001,G0005T001
G0008T001	G0002T001,G0010T001
G0010T001	G0008T001

Arguments:

$ gtftk overlapping -h
  Usage: gtftk overlapping [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-t {transcript,promoter,tts}] [-s] [-S] [-n] [-a] [-k key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Find transcripts whose body/TSS/TTS region extended in 5' and 3' (-u/-d) overlaps with any
     transcript from another gene. Strandness is not considered by default. Used --invert-match to
     find those that do not overlap. If --annotate-gtf is used, all lines of the input GTF file will
     be printed and a new key containing the list of overlapping transcripts will be added to the
     transcript features/lines (key will be 'overlapping_*' with * one of body/TSS/TTS). The
     --annotate-gtf and --invert-match arguments are mutually exclusive.

  Version:  2018-01-24

Arguments:
 -i, --inputfile          Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile         Output file. (default: <stdout>)
 -c, --chrom-info         Chromosome information. A tabulated two-columns file with chromosomes as column 1 and sizes as column 2 (default: None)
 -u, --upstream           Extend the region in 5' by a given value (int). Used to define the region around the TSS/TTS. (default: 1500)
 -d, --downstream         Extend the region in 3' by a given value (int). Used to define the region around the TSS/TTS. (default: 1500)
 -t, --feature-type       The feature of interest. (default: transcript)
 -s, --same-strandedness  Require same strandedness (default: False)
 -S, --diff-strandedness  Require different strandedness (default: False)
 -n, --invert-match       Not/Invert match. (default: False)
 -a, --annotate-gtf       All lines of the original GTF will be printed. (default: False)
 -k, --key-name           The name of the key. (default: None)

Command-wise optional arguments:
 -h, --help               Show this help message and exit.
 -V, --verbosity          Increase output verbosity. (default: 0)
 -D, --no-date            Do not add date to output file names. (default: False)
 -C, --add-chr            Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir            Keep all temporary files into this folder. (default: None)
 -A, --keep-all           Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file        Stores the arguments passed to the command into a file. (default: None)

divergent

Description: Find transcript with divergent promoters. These transcripts will be defined here as those whose promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as described in “Divergent transcription is associated with promoters of transcriptional regulators” (Lepoivre C, BMC Genomics, 2013). The ouput is a GTF with an additional key (‘divergent’) whose value is set to ‘.’ if the gene has no antisens transcript in its promoter region. If the gene has an antisens transcript in its promoter region the ‘divergent’ key is set to the identifier of the transcript whose tss is the closest relative to the considered promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).

Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.

$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt;  gtftk get_example |  gtftk divergent -c simple_join_chromInfo.txt -u 10 -d 10| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,divergent,dist_to_divergent | head  -n 7
transcript_id	divergent	dist_to_divergent
G0003T001	G0004T002	4.0
G0004T002	G0003T001	4.0
G0004T001	G0003T001	4.0

Arguments:

$ gtftk divergent -h
  Usage: gtftk divergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-n] [-S] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Find transcripts with divergent promoters. These transcripts will be defined here as those whose
     promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens
     orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as
     described in "Divergent transcription is associated with promoters of transcriptional
     regulators" (Lepoivre C, BMC Genomics, 2013). The ouput is a GTF with an additional key
     ('divergent') whose value is set to '.' if the gene has no antisens transcript in its promoter
     region. If the gene has an antisens transcript in its promoter region the 'divergent' key is
     set to the identifier of the transcript whose tss is the closest relative to the considered
     promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).

  Version:  2018-01-24

Arguments:
 -i, --inputfile      Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile     Output file. (default: <stdout>)
 -c, --chrom-info     Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
 -u, --upstream       Extend the promoter in 5' by a given value (int). Defines the region around the tss. (default: 1500)
 -d, --downstream     Extend the region in 3' by a given value (int). Defines the region around the tss. (default: 1500)
 -n, --no-annotation  Do not annotate the GTF. Just select the divergent transcripts. (default: False)
 -S, --no-strandness  Do not consider strandness (only look whether the promoter from a transcript overlap with the promoter from another gene). (default: False)
 -a, --key-name       The name of the key. (default: None)

Command-wise optional arguments:
 -h, --help           Show this help message and exit.
 -V, --verbosity      Increase output verbosity. (default: 0)
 -D, --no-date        Do not add date to output file names. (default: False)
 -C, --add-chr        Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir        Keep all temporary files into this folder. (default: None)
 -A, --keep-all       Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file    Stores the arguments passed to the command into a file. (default: None)

convergent

Description: Find transcript with convergent tts. These transcripts will be defined here as those whose tts region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens orientation. The ouput is a GTF with an additional key (‘convergent’) whose value is set to ‘.’ if the gene has no convergent transcript in its tts region. If the gene has an antisens transcript in its tts region the ‘convergent’ key is set to the identifier of the transcript whose tts is the closest relative to the considered tts. The tts to tts distance is also provided as an additional key (dist_to_convergent).

Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.

$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt;  gtftk get_example |  gtftk convergent -c simple_join_chromInfo.txt -u 25 -d 25| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,convergent,dist_to_convergent| head -n 4
transcript_id	convergent	dist_to_convergent
G0002T001	G0008T001	21.0
G0008T001	G0002T001	21.0
G0010T001	G0008T001	24.0

Arguments:

$ gtftk convergent -h
  Usage: gtftk convergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Find transcripts with convergent tts. These transcripts will be defined here as those whose tts
     region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens
     orientation. The ouput is a GTF with an additional key ('convergent') whose value is set to '.'
     if the gene has no convergent transcript in its tts region. If the gene has an antisens
     transcript in its tts region the 'convergent' key is set to the identifier of the transcript
     whose tts is the closest relative to the considered tts. The tts to tts distance is also
     provided as an additional key (dist_to_convergent).

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -c, --chrom-info   Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
 -u, --upstream     Extend the tts in 5' by a given value (int). Defines the region around the tts. (default: 1500)
 -d, --downstream   Extend the region in 3' by a given value (int). Defines the region around the tts. (default: 1500)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

exon_sizes

Description: Add a new key to transcript features containing a comma separated list of exon sizes.

Example:

$ gtftk get_example | gtftk exon_sizes | gtftk select_by_key -t
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_sizes "14";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_sizes "14";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; exon_sizes "10";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_sizes "5,5";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_sizes "4,1,3";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_sizes "4,1,3";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; exon_sizes "6,3";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_sizes "3,3,4";
chr1	gtftk	transcript	28	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; exon_sizes "3,3";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001"; exon_sizes "10";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T002"; exon_sizes "10";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; exon_sizes "3,5";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; exon_sizes "12";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; exon_sizes "12";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001"; exon_sizes "11";

Arguments:

$ gtftk exon_sizes -h
  Usage: gtftk exon_sizes [-i GTF] [-o TXT] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Add a new key to transcript features containing a comma separated list of exon sizes.

  Version:  2018-01-24

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile   Output GTF file. (default: <stdout>)
 -a, --key-name     The name of the key. (default: exon_sizes)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

intron_sizes

Description: Add a new key to transcript features containing a comma separated list of intron sizes.

Example:

$ gtftk get_example | gtftk intron_sizes | gtftk select_by_key -t
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; intron_sizes "0";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; intron_sizes "0";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; intron_sizes "0";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; intron_sizes "2";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; intron_sizes "2,2";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; intron_sizes "2,2";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; intron_sizes "6";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; intron_sizes "2,2";
chr1	gtftk	transcript	28	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; intron_sizes "2";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001"; intron_sizes "0";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T002"; intron_sizes "0";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; intron_sizes "5";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; intron_sizes "0";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; intron_sizes "0";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001"; intron_sizes "0";

Arguments:

$ gtftk intron_sizes -h
  Usage: gtftk intron_sizes [-i GTF] [-o TXT] [-a key_name] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Add a new key to transcript features containing a comma separated list of intron-size.

  Version:  2018-01-24

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile   Output file. (default: <stdout>)
 -a, --key-name     The name of the key. (default: intron_sizes)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘coordinates’

midpoints

Description: Get the genomic midpoint of each features: genes, transcripts, exons or introns. Output is currently in bed format only.

Example: Get mipoints of all transcripts and exons.

$ gtftk get_example | gtftk midpoints -t transcript,exon -n transcript_id,feature | head -n 5
chr1	7	9	G0009T002|transcript	.	-
chr1	7	9	G0009T001|exon	.	-
chr1	7	9	G0009T001|transcript	.	-
chr1	7	9	G0009T002|exon	.	-
chr1	27	29	G0006T001|transcript	.	-

Arguments:

$ gtftk midpoints -h
  Usage: gtftk midpoints [-i GTF/BED] [-o BED] [-t ft_type] [-n NAME] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get the midpoint coordinates for the requested feature. Output is bed format.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file (BED). (default: <stdout>)
 -t, --ft-type      The target feature (as found in the 3rd column of the GTF). (default: transcript)
 -n, --names        The key(s) that should be used as name. (default: transcript_id)
 -s, --separator    The separator to be used for separating name elements (see -n). (default: |)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

5p_3p_coords

Description: Get the 5p or 3p coordinates for each feature (e.g TSS or TTS for a transcript). Output is bed format.

Example: Get the 5p ends of transcripts and exons.

$ gtftk get_example | gtftk get_5p_3p_coords -t transcript,exon -n transcript_id,gene_id,feature | head -n 5
chr1	124	125	G0001T002|G0001|transcript	.	+
chr1	124	125	G0001T002|G0001|exon	.	+
chr1	124	125	G0001T001|G0001|transcript	.	+
chr1	124	125	G0001T001|G0001|exon	.	+
chr1	179	180	G0002T001|G0002|transcript	.	+

Example: Get the 3p ends of transcripts and exons.

$ gtftk get_example | gtftk get_5p_3p_coords -t transcript,exon -n transcript_id,gene_id,feature -v -s "^"| head -n 5
chr1	137	138	G0001T002^G0001^transcript	.	+
chr1	137	138	G0001T002^G0001^exon	.	+
chr1	137	138	G0001T001^G0001^transcript	.	+
chr1	137	138	G0001T001^G0001^exon	.	+
chr1	188	189	G0002T001^G0002^transcript	.	+

Arguments:

$ gtftk get_5p_3p_coords -h
  Usage: gtftk get_5p_3p_coords [-i GTF] [-o BED] [-t ft_type] [-v] [-p transpose] [-n NAME] [-m more_names] [-s SEP] [-e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get the 5p or 3p coordinate for each feature (e.g TSS or TTS for a transcript).

  Notes:
     *  Output is in BED format.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file (BED). (default: <stdout>)
 -t, --ft-type      The target feature (as found in the 3rd column of the GTF). (default: transcript)
 -v, --invert       Get 3' coordinate. (default: False)
 -p, --transpose    Transpose coordinate in 5' (use negative value) or in 3' (use positive values). (default: 0)
 -n, --names        The key(s) that should be used as name. (default: gene_id,transcript_id)
 -m, --more-names   A comma separated list of information to be added to the 'name' column of the bed file. (default: None)
 -s, --separator    The separator to be used for separating name elements (see -n). (default: |)
 -e, --explicit     Write explicitly the name of the keys in the header. (default: False)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

intergenic

Description: Extract intergenic regions. This command requires a chromInfo file to compute the bed file boundaries. The command will print the coordinates of genomic regions without transcript features.

Example: Simply get intergenic regions.

$ gtftk get_example -f chromInfo > simple_join_chromInfo.txt; gtftk get_example |  gtftk intergenic   -c simple_join_chromInfo.txt
chr1	0	2	region_1	0	.
chr1	14	21	region_2	0	.
chr1	47	49	region_3	0	.
chr1	61	64	region_4	0	.
chr1	76	106	region_5	0	.
chr1	116	124	region_6	0	.
chr1	138	175	region_7	0	.
chr1	189	209	region_8	0	.
chr1	222	300	region_9	0	.
chr2	0	600	region_10	0	.

Arguments:

$ gtftk intergenic -h
  Usage: gtftk intergenic [-i GTF] [-o BED] -c CHROMINFO [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Extract intergenic regions. This command requires a chromInfo file to compute the bed file
     boundaries. The command will print the coordinates of genomic regions without any transcript
     features.

  Version:  2018-01-20

Arguments:
 -i, --inputfile    Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile   Output file (BED). (default: <stdout>)
 -c, --chrom-info   Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)

Command-wise optional arguments:
 -h, --help         Show this help message and exit.
 -V, --verbosity    Increase output verbosity. (default: 0)
 -D, --no-date      Do not add date to output file names. (default: False)
 -C, --add-chr      Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir      Keep all temporary files into this folder. (default: None)
 -A, --keep-all     Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file  Stores the arguments passed to the command into a file. (default: None)

intronic

Description: Returns a bed file containing the intronic regions. If by_transcript is false (default), returns merged genic regions with no exonic overlap (“strict” mode). Otherwise, the intronic regions corresponding to each transcript are returned (may contain exonic overlap and redundancy).

Example: Simply get intronic regions.

$ gtftk get_example |  gtftk intronic | head -n 5
chr1	25	27
chr1	30	32
chr1	35	41
chr1	54	56
chr1	68	70

Example: Intronic regions of each transcript.

$ gtftk get_example |  gtftk intronic -b
chr1	54	56	intron|G0003|G0003T001	1	-
chr1	68	70	intron|G0004|G0004T002	1	+
chr1	71	73	intron|G0004|G0004T002	2	+
chr1	68	70	intron|G0004|G0004T001	1	+
chr1	71	73	intron|G0004|G0004T001	2	+
chr1	35	41	intron|G0005|G0005T001	1	-
chr1	25	27	intron|G0006|G0006T001	2	-
chr1	30	32	intron|G0006|G0006T001	1	-
chr1	30	32	intron|G0006|G0006T002	1	-
chr1	214	219	intron|G0008|G0008T001	1	-

Arguments:

$ gtftk intronic -h
  Usage: gtftk intronic [-i GTF] [-o BED] [-b] [-n NAME] [-s SEP] [-w] [-F] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Returns a bed file containing the intronic regions. If by_transcript is false (default), returns
     merged genic regions with no exonic overlap ("strict" mode). Otherwise, the intronic regions
     corresponding to each transcript are returned (may contain exonic overlap and redundancy).

  Version:  2018-01-20

Arguments:
 -i, --inputfile          Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile         Output file (BED). (default: <stdout>)
 -b, --by-transcript      The intronic regions are returned for each transcript. (default: False)
 -n, --names              The key(s) that should be used as name (if -b is used). (default: gene_id,transcript_id)
 -s, --separator          The separator to be used for separating name elements (if -b is used). (default: |)
 -w, --intron-nb-in-name  By default intron number is written in 'score' column. Force it to be written in 'name' column. transcript. (default: False)
 -F, --no-feature-name    Don't add the feature name ('intron') in the name column. (default: False)

Command-wise optional arguments:
 -h, --help               Show this help message and exit.
 -V, --verbosity          Increase output verbosity. (default: 0)
 -D, --no-date            Do not add date to output file names. (default: False)
 -C, --add-chr            Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir            Keep all temporary files into this folder. (default: None)
 -A, --keep-all           Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file        Stores the arguments passed to the command into a file. (default: None)

splicing_site

Description: Compute the locations of donor and acceptor splice sites. This command will return a single position which corresponds to the most 5’ and/or the most 3’ intronic region. If the gtf file does not contain exon numbering you can compute it using the add_exon_nb command. The score column of the bed file contain the number of the closest exon relative to the splice site.

Example:

$ gtftk get_example | gtftk add_exon_nb -k exon_nbr | gtftk splicing_site  -k exon_nbr| head
chr1	54	55	acceptor|G0003T001E001|G0003T001|G0003	2	-
chr1	55	56	donor|G0003T001E002|G0003T001|G0003	1	-
chr1	68	69	donor|G0004T002E001|G0004T002|G0004	1	+
chr1	71	72	donor|G0004T002E002|G0004T002|G0004	2	+
chr1	69	70	acceptor|G0004T002E002|G0004T002|G0004	2	+
chr1	72	73	acceptor|G0004T002E003|G0004T002|G0004	3	+
chr1	68	69	donor|G0004T001E001|G0004T001|G0004	1	+
chr1	71	72	donor|G0004T001E002|G0004T001|G0004	2	+
chr1	69	70	acceptor|G0004T001E002|G0004T001|G0004	2	+
chr1	72	73	acceptor|G0004T001E003|G0004T001|G0004	3	+

Arguments:

$ gtftk splicing_site -h
  Usage: gtftk splicing_site [-i GTF] [-o BED] [-k exon_numbering_key] [-n NAME] [-s SEP] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Compute the locations of donor and acceptor splice sites.

  Notes:
     *  This will return a single position which corresponds to the most 5' and/or the most 3'
     intronic region. If the gtf file does not contain exon numbering you can compute it using the
     add_exon_nb command. The score column of the bed file contain the number of the closest exon
     relative to the splice site.

  Version:  2018-01-20

Arguments:
 -i, --inputfile           Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile          Output file. (default: <stdout>)
 -k, --exon-numbering-key  The name of the key containing the exon numbering (exon_number in ensembl) (default: exon_number)
 -n, --names               The key(s) that should be used as name. (default: exon_id,transcript_id,gene_id)
 -s, --separator           The separator to be used for separating name elements (see -n). (default: |)

Command-wise optional arguments:
 -h, --help                Show this help message and exit.
 -V, --verbosity           Increase output verbosity. (default: 0)
 -D, --no-date             Do not add date to output file names. (default: False)
 -C, --add-chr             Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir             Keep all temporary files into this folder. (default: None)
 -A, --keep-all            Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file         Stores the arguments passed to the command into a file. (default: None)

shift

Description: Shift coordinates in 3’ or 5’ direction.

Example:

$ gtftk get_example|  head -n 1
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
$ gtftk get_example -f chromInfo > simple.chromInfo; gtftk get_example |  gtftk shift -s -10 -c simple.chromInfo | head -n 1
chr1	gtftk	gene	115	128	.	+	.	gene_id "G0001";

Arguments:

$ gtftk shift -h
  Usage: gtftk shift [-i GTF] [-o GTF] -s shift_value [-d] [-a] -c CHROMINFO [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Transpose coordinates in 3' or 5' direction.

  Notes:
     *  By default shift is not strand specific. Meaning that if --shift-value is set to 10, all
     coordinates will be moved 10 bases in 5' direction relative to the forward/watson/plus/top
     strand.
     *  Use a negative value to shift in 3' direction, a positive value to shift in 5' direction.
     *  If --stranded is true, features are transposed in 5' direction relative to their associated
     strand.
     *  By default, features are not allowed to go outside the genome coordinates. In the current
     implementation, in case this would happen (using a very large --shift-value), feature would
     accumulate at the ends of chromosomes irrespectively of gene or transcript structures giving
     rise, ultimately, to several exons from the same transcript having the same starts or ends.
     *  One can forced features to go outside the genome and ultimatly dissapear with large --shift-
     value by using -a.

  Version:  2018-01-20

Arguments:
 -i, --inputfile      Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile     Output file. (default: <stdout>)
 -s, --shift-value    Shift coordinate by s nucleotides. (default: 0)
 -d, --stranded       By default shift not . (default: False)
 -a, --allow-outside  Accept the partial or total disappearance of a feature upon shifting. (default: False)
 -c, --chrom-info     Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)

Command-wise optional arguments:
 -h, --help           Show this help message and exit.
 -V, --verbosity      Increase output verbosity. (default: 0)
 -D, --no-date        Do not add date to output file names. (default: False)
 -C, --add-chr        Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir        Keep all temporary files into this folder. (default: None)
 -A, --keep-all       Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file    Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘sequence’

get_tx_seq

Description: Get transcript sequences in fasta format.

Example: Get sequences of transcripts in 5’ to 3’ orientation

$ gtftk get_example -f fa > simple.fa; gtftk get_example | gtftk get_tx_seq -g simple.fa | head -n 4
>transcript|G0001T002|G0001|chr1|125|138
cccccgttacgtag
>transcript|G0001T001|G0001|chr1|125|138
cccccgttacgtag

Note that the format is rather flexible and any combination of key can be exported to the header.

$ gtftk get_example | gtftk get_tx_seq -g simple.fa  -l gene_id,transcript_id,feature,chrom,start,end,strand  | head -n 2
>G0001|G0001T002|transcript|chr1|125|138|+
cccccgttacgtag

You can ask to add explicitly (-e) the name of the keys in the header. Here we also add the size of the mature transcript and the number of exons.

$ gtftk get_example | gtftk feature_size -t mature_rna | gtftk nb_exons| gtftk get_tx_seq -g simple.fa -l feature,transcript_id,gene_id,seqid,start,end,feat_size,nb_exons -e | head -n 2
>feature=transcript|transcript_id=G0001T002|gene_id=G0001|seqid=chr1|start=125|end=138|feat_size=14|nb_exons=1
cccccgttacgtag

You may use wildcard (path enclosed within quotes) in case the genome is splitted in several chromosome files:

$ gtftk get_example |  gtftk get_tx_seq -g '*.fa' -l gene_id,transcript_id,feature,chrom,start,end,strand -s "," | head -n 2
>G0001,G0001T002,transcript,chr1,125,138,+
cccccgttacgtag

A particular header format that should be compliant with sleuth is also proposed.

$ gtftk get_example |  gtftk get_tx_seq -g '*.fa'  -f -n  | head -n 2
>G0001T002 chromosome:GRCm38:chr1:125:138:1 gene:G0001 gene_biotype:? transcript_biotype:?
cccccgttacgtag

Arguments:

$ gtftk get_tx_seq -h
  Usage: gtftk get_tx_seq [-i GTF] [-o FASTA] -g FASTA [-w] [-s SEP] [-l label] [-f] [-d] [-a assembly] [-c] [-n] [-e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get transcripts sequences in a flexible fasta format from a GTF file.

  Notes:
     *  The sequences are returned in 5' to 3' orientation.
     *  If you want to use wildcards, use quotes :e.g. 'foo/bar*.fa'.

  Version:

Arguments:
 -i, --inputfile       Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile      Output FASTA file. (default: <stdout>)
 -g, --genome          The genome in fasta format. Accept path with wildcards (e.g. *.fa). (default: None)
 -w, --with-introns    Set to true to include intronic regions. (default: False)
 -s, --separator       To separate info in header. (default: |)
 -l, --label           A set of key for the header. (default: feature,transcript_id,gene_id,seqid,start,end)
 -f, --sleuth-format   Produce output in sleuth format. (default: False)
 -d, --delete-version  In case of --sleuth-format, delete gene_id or transcript_id version number (e.g '.2' in ENSG56765.2). (default: False)
 -a, --assembly        In case of --sleuth-format, an assembly version. (default: GRCm38)
 -c, --del-chr         When using --sleuth-format delete 'chr' in sequence id. (default: False)
 -n, --no-rev-comp     Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
 -e, --explicit        Write explicitly the name of the keys in the header. (default: False)

Command-wise optional arguments:
 -h, --help            Show this help message and exit.
 -V, --verbosity       Increase output verbosity. (default: 0)
 -D, --no-date         Do not add date to output file names. (default: False)
 -C, --add-chr         Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir         Keep all temporary files into this folder. (default: None)
 -A, --keep-all        Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file     Stores the arguments passed to the command into a file. (default: None)

get_feat_seq

Description: Get feature sequence (e.g exon, UTR…).

Example:

$ gtftk get_feat_seq -i simple.gtf -g simple.fa  -l feature,transcript_id,start -t  exon -n | head -10
>G0001T002|125|exon|125|138
cccccgttacgtag
>G0001T001|125|exon|125|138
cccccgttacgtag
>G0002T001|180|exon|180|189
ggccttatta
>G0003T001|50|exon|50|54
caagc
>G0003T001|50|exon|57|61
taatt

Arguments:

$ gtftk get_feat_seq -h
  Usage: gtftk get_feat_seq [-i GTF] [-o FASTA] -g genome [-s separator] [-l label] [-t feature_type] [-n] [-e] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Get feature sequences in a flexible fasta format from a GTF file.

  Notes:
     *  The sequences are returned in 5' to 3' orientation.
     *  If you want to use wildcards, use quotes :e.g. 'foo/bar*.fa'.

  Version:

Arguments:
 -i, --inputfile     Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile    Output FASTA file. (default: <stdout>)
 -g, --genome        The genome in fasta format. Accept path with wildcards (e.g. *.fa). (default: None)
 -s, --separator     To separate info in header. (default: |)
 -l, --label         A set of key for the header that will be extracted from the transcript line. (default: feature,transcript_id,gene_id,seqid,start,end)
 -t, --feature-type  The feature type (one defined in column 3). (default: exon)
 -n, --no-rev-comp   Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
 -e, --explicit      Write explicitly the name of the keys in the header. (default: False)

Command-wise optional arguments:
 -h, --help          Show this help message and exit.
 -V, --verbosity     Increase output verbosity. (default: 0)
 -D, --no-date       Do not add date to output file names. (default: False)
 -C, --add-chr       Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir       Keep all temporary files into this folder. (default: None)
 -A, --keep-all      Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file   Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘coverage’

coverage

Description: Takes a GTF as input to compute bigwig coverage in regions of interest (promoter, transcript body, intron, intron_by_tx, tts…) or a BED6 to focus on user-defined regions. If –n-highest is used the program will compute the coverage of each bigwig based on the average value of the n windows (–nb-window) with the highest coverage values. Regions were signal can be computed (if GTF file as input) are promoter, tts, introns, intergenic regions or any feature available in the GTF file (transcript, exon, gene…). If –matrix-out is selected, the signal for each bigwig will be provided in a dedicated column. Otherwise, signal for each bigwig is provided through a dedicated line.

Example:

We will first request a lightweight example dataset.

$ gtftk get_example -d mini_real -f '*'
 |-- 17:46-INFO-get_example : Copying: #H3K4me3_cond_1.bed#
 |-- 17:46-INFO-get_example : Copying: airway_love.txt.gz
 |-- 17:46-INFO-get_example : Copying: ENCFF112BHN_H3K4me3_K562_sub.bed
 |-- 17:46-INFO-get_example : Copying: ENCFF119BYM_H3K36me3_K562_sub.bed
 |-- 17:46-INFO-get_example : Copying: ENCFF431HAA_H3K36me3_K562_sub.bw
 |-- 17:46-INFO-get_example : Copying: ENCFF742FDS_H3K4me3_K562_sub.bw
 |-- 17:46-INFO-get_example : Copying: ENCFF947DVY_H3K79me2_K562_sub.bw
 |-- 17:46-INFO-get_example : Copying: H3K4me3_cond_1.bed
 |-- 17:46-INFO-get_example : Copying: H3K4me3_cond_2.bed
 |-- 17:46-INFO-get_example : Copying: H3K4me3_cond_3.bed
 |-- 17:46-INFO-get_example : Copying: hg38.genome
 |-- 17:46-INFO-get_example : Copying: hg38.genome.back
 |-- 17:46-INFO-get_example : Copying: mini_real.gtf.gz
 |-- 17:46-INFO-get_example : Copying: mini_real_control_1.txt
 |-- 17:46-INFO-get_example : Copying: mini_real_counts_ENCFF630HEX.txt
 |-- 17:46-INFO-get_example : Copying: mini_real_gn_list_hg38.txt
 |-- 17:46-INFO-get_example : Copying: tx_classes.txt

Although we could work on the full dataset, we will focus on transcripts whose promoter region do not overlaps with any transcript from another gene.

$ gtftk overlapping -i mini_real.gtf.gz -c hg38.genome  -n > mini_real_noov.gtf

We will select a representative transcript for each gene. Here we will perform this step using random_tx although another interesting choice would be rm_dup_tss.

$ gtftk random_tx -i mini_real_noov.gtf  -m 1 -s 123 > mini_real_noov_rnd_tx.gtf

Now we will compute coverage of promoters regions using 3 bigWig files as input.

$ gtftk coverage -l H3K4me3,H3K79me2,H3K36me3 -u 5000 -d 5000 -i mini_real_noov_rnd_tx.gtf -c hg38.genome -m transcript_id,gene_name -x ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -k 4 > coverage.bed

Now we can have a look at the result:

$ head -n 10 coverage.bed
chrom	start	end	name	strand	H3K4me3	H3K79me2	H3K36me3
chr1	996137	1006138	ENST00000624697	+	5.859314	4.0025	1.632737
chr1	1370168	1380169	ENST00000321751	-	25.743926000000002	12.174783	3.20208
chr1	1913796	1923797	ENST00000378602	-	5.943206	2.382962	1.3906610000000001
chr1	2189747	2199748	ENST00000420515	-	9.861013999999999	6.5540449999999995	2.187781
chr1	2492781	2502782	ENST00000473964	+	1.312969	1.083992	1.123888
chr1	3064210	3074211	ENST00000270722	+	1.0	1.0	1.0
chr1	3645072	3655073	ENST00000270708	-	10.365663	4.864514	3.6577339999999996
chr1	6419669	6429670	ENST00000377837	-	2.9228080000000003	2.569843	2.130887
chr1	9178749	9188750	ENST00000437157	-	1.331067	1.272973	1.436956

Arguments:

$ gtftk coverage -h
  Usage: gtftk coverage [-i GTF/BED] [-o TXT] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-w nb_window] [-k nb_proc] [-f ft_type] [-l labels] [-m name_column] [-p pseudo_count] [-n n_highest] [-x] [-zn] [-a key_name] [-s {mean,sum}] [-h] [-V ] [-D] [-C] [-K] [-A] [-L] bw_list [bw_list ...]

  Description: 
     *  Takes a GTF as input to compute bigwig coverage in regions of interest (promoter, transcript body,
     intron, intron_by_tx, tts...) or a BED6 to focus on user-defined regions. If --n-highest is
     used the program will compute the coverage of each bigwig based on the average value of the n
     windows (--nb-window) with the highest coverage values.

  Notes:
     *  Regions were signal can be computed (if GTF file as input): promoter/tss, tts, introns,
     intron_by_tx, intergenic regions or any feature available in the GTF file (transcript, exon,
     gene...).
     *  If --matrix-out is selected, the signal for each bigwig will be provided in a dedicated
     column. Otherwise, signal for each bigwig is provided through a dedicated line.
     *  If bed is used as input, each region should have its own name (column 4).

  Version:  2018-02-05

Arguments:
 bw_list             A list of Bigwig file (last argument).
 -i, --inputfile     The input GTF/BED file. Only GTF file if <stdin> is used. (default: <stdin>)
 -o, --outputfile    Output file. (default: <stdout>)
 -c, --chrom-info    Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
 -u, --upstream      Extend the regions in 5' by a given value (int). (default: 0)
 -d, --downstream    Extend the regions in 3' by a given value (int). (default: 0)
 -w, --nb-window     Split the region into w bins (see -n). (default: 1)
 -k, --nb-proc       Use this many threads to compute coverage. (default: 1)
 -f, --ft-type       Region in which coverage is to be computed (promoter, intron, intergenic, tts or any feature defined in the column 3 of the GTF). (default: promoter)
 -l, --labels        Bigwig labels. (default: None)
 -m, --name-column   Use this ids to compute the name (4th column in bed output). (default: transcript_id)
 -p, --pseudo-count  A pseudo-count to add in case count is equal to 0. (default: 1)
 -n, --n-highest     For each bigwig, use the n windows with higher values to compute coverage. (default: None)
 -x, --matrix-out    Matrix output format. Bigwigs as column names features as rows. (default: False)
 -zn, --zero-to-na   Use NA not zero when region is undefined in bigwig or below window size. (default: False)
 -a, --key-name      If gtf format is requested, the name of the key. (default: cov)
 -s, --stat          The statistics to be computed for each region. (default: mean)

Command-wise optional arguments:
 -h, --help          Show this help message and exit.
 -V, --verbosity     Increase output verbosity. (default: 0)
 -D, --no-date       Do not add date to output file names. (default: False)
 -C, --add-chr       Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir       Keep all temporary files into this folder. (default: None)
 -A, --keep-all      Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file   Stores the arguments passed to the command into a file. (default: None)

mk_matrix

Description: Gtftk implements commands that can be used to produce coverage profiles around genomic features or inside user-defined regions. A coverage matrix need first to be produced from a bwig using the mk_matrix command.

Example:

We will used the same dataset (mini_real.gtf) as produced above (see help on coverage command).

We can now create a coverage matrix around TSS/TTS or along the full transcript (with or without 5’ and 3’ regions). Provide a BED file as —inputfile if you want to use your own, user-specific, regions. Will will create tree example datasets:

First we will create a coverage matrix around promoter based on a subset of randomly choose transcripts (one per gene) from the ‘mini_real’ dataset (see section on the coverage command to get info about the construction of the mini_real_noov_rnd_tx.gtf.gz dataset).

$ gtftk get_example -f '*' -d mini_real
 |-- 17:46-INFO-get_example : Copying: #H3K4me3_cond_1.bed#
 |-- 17:46-INFO-get_example : Copying: airway_love.txt.gz
 |-- 17:46-INFO-get_example : Copying: ENCFF112BHN_H3K4me3_K562_sub.bed
 |-- 17:46-INFO-get_example : Copying: ENCFF119BYM_H3K36me3_K562_sub.bed
 |-- 17:46-INFO-get_example : Copying: ENCFF431HAA_H3K36me3_K562_sub.bw
 |-- 17:46-INFO-get_example : Copying: ENCFF742FDS_H3K4me3_K562_sub.bw
 |-- 17:46-INFO-get_example : Copying: ENCFF947DVY_H3K79me2_K562_sub.bw
 |-- 17:46-INFO-get_example : Copying: H3K4me3_cond_1.bed
 |-- 17:46-INFO-get_example : Copying: H3K4me3_cond_2.bed
 |-- 17:46-INFO-get_example : Copying: H3K4me3_cond_3.bed
 |-- 17:46-INFO-get_example : Copying: hg38.genome
 |-- 17:46-INFO-get_example : Copying: hg38.genome.back
 |-- 17:46-INFO-get_example : Copying: mini_real.gtf.gz
 |-- 17:46-INFO-get_example : Copying: mini_real_control_1.txt
 |-- 17:46-INFO-get_example : Copying: mini_real_counts_ENCFF630HEX.txt
 |-- 17:46-INFO-get_example : Copying: mini_real_gn_list_hg38.txt
 |-- 17:46-INFO-get_example : Copying: tx_classes.txt
$ gtftk get_example -f '*' -d mini_real_noov_rnd_tx
 |-- 17:46-INFO-get_example : Copying: mini_real_noov_rnd_tx.gtf.gz
$ gtftk mk_matrix -k 5 -i mini_real_noov_rnd_tx.gtf.gz -d 5000 -u 5000 -w 200 -c hg38.genome  -l  H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_promoter

Then we will also compute coverage profil around around tts.

$ gtftk mk_matrix -k 5 -i mini_real_noov_rnd_tx.gtf.gz -t tts  -d 5000 -u 5000 -w 200 -c hg38.genome  -l  H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_tts

The following command compute coverage profil along the whole transcript

$ gtftk mk_matrix -k 5 -i mini_real_noov_rnd_tx.gtf.gz -t transcript  -d 5000 -u 5000 -w 200 -c hg38.genome  -l  H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_tx
 |-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:47-WARNING-mk_matrix : ENST00000612829 has length : 85
 |-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:47-WARNING-mk_matrix : Filter them out please.
 |-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:47-WARNING-mk_matrix : ENST00000385018 has length : 82
 |-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:47-WARNING-mk_matrix : Filter them out please.
 |-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:47-WARNING-mk_matrix : ENST00000583764 has length : 85
 |-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:47-WARNING-mk_matrix : Filter them out please.
 |-- 17:47-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:47-WARNING-mk_matrix : ENST00000637495 has length : 68
 |-- 17:47-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:47-WARNING-mk_matrix : Filter them out please.

Along the whole transcript but increasing the number of windows dedicated to upstream and downstream regions.

$ gtftk mk_matrix -k 5 --bin-around-frac 0.5 -i mini_real_noov_rnd_tx.gtf.gz -t transcript  -d 5000 -u 5000 -w 200 -c hg38.genome  -l  H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_tx_2
 |-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:48-WARNING-mk_matrix : ENST00000612829 has length : 85
 |-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:48-WARNING-mk_matrix : Filter them out please.
 |-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:48-WARNING-mk_matrix : ENST00000385018 has length : 82
 |-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:48-WARNING-mk_matrix : Filter them out please.
 |-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:48-WARNING-mk_matrix : ENST00000583764 has length : 85
 |-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:48-WARNING-mk_matrix : Filter them out please.
 |-- 17:48-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:48-WARNING-mk_matrix : ENST00000637495 has length : 68
 |-- 17:48-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:48-WARNING-mk_matrix : Filter them out please.

Along a user defined set of regions (in bed6 format). Here we will used the transcript coordinates in bed format as an example.

$ gtftk select_by_key -i mini_real_noov_rnd_tx.gtf.gz -k feature -v transcript | gtftk convert -f bed6 > mini_real_rnd_tx.bed
$ gtftk mk_matrix -k 5 --bin-around-frac 0.5 -i mini_real_rnd_tx.bed -t user_regions  -d 5000 -u 5000 -w 200 -c hg38.genome  -l  H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_user_def
 |-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:49-WARNING-mk_matrix : ENSG00000187514|ENST00000612829 has length : 85
 |-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:49-WARNING-mk_matrix : Filter them out please.
 |-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:49-WARNING-mk_matrix : ENSG00000207751|ENST00000385018 has length : 82
 |-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:49-WARNING-mk_matrix : Filter them out please.
 |-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:49-WARNING-mk_matrix : ENSG00000110717|ENST00000583764 has length : 85
 |-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:49-WARNING-mk_matrix : Filter them out please.
 |-- 17:49-WARNING-mk_matrix : Encountered regions shorter than bin number.
 |-- 17:49-WARNING-mk_matrix : ENSG00000148120|ENST00000637495 has length : 68
 |-- 17:49-WARNING-mk_matrix : They will be set to NA or --pseudo-count depending on --zero-to-na.
 |-- 17:49-WARNING-mk_matrix : Filter them out please.

And finally using a set of single nucleotides coordinates that will be extend (-u/-d) and assessed for coverage. Here we will take the coordinates of TSS as example.

$ gtftk select_by_key -i mini_real_noov_rnd_tx.gtf.gz -k feature -v transcript |  gtftk get_5p_3p_coords > tss.bed
$ gtftk mk_matrix -k 5 -u 5000 -d 5000 -i tss.bed -w 200 -l  H3K4me3,H3K79me,H3K36me3 ENCFF742FDS_H3K4me3_K562_sub.bw ENCFF947DVY_H3K79me2_K562_sub.bw ENCFF431HAA_H3K36me3_K562_sub.bw -o mini_real_single_nuc -c hg38.genome -t single_nuc

profile

Description: This command is used to create profil diagrams from a mk_matrix output. The two important arguments for this command are —group-by, that defines the variable controling the set of colored lines and —facet-var that defines the variable controling the way the plot is facetted . Both —group-by and —facet-var should be set to one of bwig, tx_classes or chrom.

Basic profiles

A simple overlayed profile of all epigenetic marks around promoter. Here —group-by is, by default set to bwig and —facet-var is set to None. Thus a single plot with several lines corresponding to bwig coverage is obtained.

$ gtftk profile -D -i mini_real_promoter.zip -o profile_prom -pf png -if example_01.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_01.png

The same diagram is obtained if a bed file pointing to TSS was provided to mk_matrix and used in single_nuc mode.

$ gtftk profile -i mini_real_single_nuc.zip -o profile_prom -pf png -if example_01a.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_01a.png

Changing colors and applying color order can be done using the following syntax:

$ gtftk profile -D -i mini_real_promoter.zip -c 'red,blue,violet' -d H3K79me,H3K4me3,H3K36me3 -o profile_prom -pf png -if example_01b.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_01b.png

A subset of bigwig assessed for coverage can be selected for plotting. This is achieved using the –subset-bwig argument:

$ gtftk profile -f bwig -g tx_classes -D -i mini_real_tx.zip  -fo  -o profile_tx -pf png -if example_01c.png  -fo -c 'red' -V 2  -w  -tl -e -lw 0.5 -u H3K4me3
 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4
 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
 |-- 17:50-DEBUG-profile : Using numpy version 1.13.3
 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
 |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
 |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_xf6dvl8r
 |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_xf6dvl8r/mini_real_tx
 |-- 17:50-INFO-profile : Getting configuration info from input file.
 |-- 17:50-DEBUG-profile : Color order :['All transcripts']
 |-- 17:50-DEBUG-profile : Profile color :['red']
 |-- 17:50-INFO-profile : Searching coverage columns.
 |-- 17:50-INFO-profile : Melting.
 |-- 17:50-INFO-profile : Ceiling
 |-- 17:50-INFO-profile : Zero value detected. Adding a pseudocount (+1) before log transformation.
 |-- 17:50-INFO-profile : Converting to log2.
 |-- 17:50-INFO-profile : Computing column ordering.
 |-- 17:50-INFO-profile : Preparing diagram
 |-- 17:50-INFO-profile : Theming and ordering. Please be patient...
 |-- 17:50-INFO-profile : Preparing x axis
 |-- 17:50-INFO-profile : facet_col 1
 |-- 17:50-INFO-profile : Highlighting upstream regions
 |-- 17:50-INFO-profile : Page width set to 4
 |-- 17:50-INFO-profile : Page height set to 2.0
 |-- 17:50-INFO-profile : Saving diagram to file : example_01c.png
 |-- 17:50-INFO-profile : Be patient. This may be long for large datasets.
example_01c.png

Transcript coverage is obtained using the mini_real_tx.zip matrix. This provides a simple overlayed profile of all epigenetic marks along the transcript body extended in 5’ and 3’ regions:

$ gtftk profile -D -i mini_real_tx.zip -o profile_tx -pf png -if example_02.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_02.png

Almost the same but increasing the bins dedicated to upstream and dowstream regions (see —bin-around-frac argument of *mk_matrix).

$ gtftk profile -D -i mini_real_tx_2.zip -o profile_tx -pf png -if example_03.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_03.png

Note that the same is obtained when using user-defined regions (i.e when providing a bed as input corresponding to transcript coordinates).

$ gtftk profile -D -i mini_real_user_def.zip -o profile_udef_4  -pf png -if example_04.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_04.png

The same dataset used for plotting but adding a normalization step (ranging). When using ranging normalization, values are expressed as a percentage of the range between max and min value.

$ gtftk profile -D -nm ranging -i mini_real_user_def.zip -o profile_udef_5  -pf png -if example_04b.png
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
example_04b.png

Two examples using statistic ‘max’ and 2 differents values of ‘–upper-limit’.

$ gtftk profile -D -i mini_real_promoter.zip -o profile_prom -pf png -if example_04_max_a.png  -V 2 -lw 1 -at 5 -s max -ul 1
 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4
 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
 |-- 17:50-DEBUG-profile : Using numpy version 1.13.3
 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
 |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
 |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_7qeuvl3_
 |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_7qeuvl3_/mini_real_promoter
 |-- 17:50-INFO-profile : Getting configuration info from input file.
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
 |-- 17:50-DEBUG-profile : Color order :['H3K36me3', 'H3K4me3', 'H3K79me']
 |-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
 |-- 17:50-INFO-profile : Searching coverage columns.
 |-- 17:50-INFO-profile : Melting.
 |-- 17:50-INFO-profile : Computing column ordering.
 |-- 17:50-INFO-profile : Preparing diagram
 |-- 17:50-INFO-profile : Theming and ordering. Please be patient...
 |-- 17:50-INFO-profile : Preparing x axis
 |-- 17:50-INFO-profile : facet_col 1
 |-- 17:50-INFO-profile : Page width set to 3
 |-- 17:50-INFO-profile : Page height set to 2
 |-- 17:50-INFO-profile : Saving diagram to file : example_04_max_a.png
 |-- 17:50-INFO-profile : Be patient. This may be long for large datasets.
example_04_max_a.png
$ gtftk profile -D -i mini_real_promoter.zip -o profile_prom -pf png -if example_04_max_b.png  -V 2 -lw 1 -at 5 -s max -ul 0.99
 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4
 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
 |-- 17:50-DEBUG-profile : Using numpy version 1.13.3
 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
 |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
 |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_fc_fcunw
 |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_fc_fcunw/mini_real_promoter
 |-- 17:50-INFO-profile : Getting configuration info from input file.
 |-- 17:50-WARNING-profile : --group-by not set. Choosing 'bwig'.
 |-- 17:50-DEBUG-profile : Color order :['H3K79me', 'H3K4me3', 'H3K36me3']
 |-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
 |-- 17:50-INFO-profile : Searching coverage columns.
 |-- 17:50-INFO-profile : Melting.
 |-- 17:50-INFO-profile : Ceiling
 |-- 17:50-INFO-profile : Computing column ordering.
 |-- 17:50-INFO-profile : Preparing diagram
 |-- 17:50-INFO-profile : Theming and ordering. Please be patient...
 |-- 17:50-INFO-profile : Preparing x axis
 |-- 17:50-INFO-profile : facet_col 1
 |-- 17:50-INFO-profile : Page width set to 3
 |-- 17:50-INFO-profile : Page height set to 2
 |-- 17:50-INFO-profile : Saving diagram to file : example_04_max_b.png
 |-- 17:50-INFO-profile : Be patient. This may be long for large datasets.
example_04_max_b.png

Faceted profiles

Faceted plot of epigenetic profiles. The groups (i.e colors/lines) can be set to bwig classes and the facets to transcript classes. Things can be simply done by providing an additional file containing the transcript and their associated classes.

Example:

$ gtftk profile -D -i mini_real_promoter.zip -f tx_classes -g bwig -fo -t tx_classes.txt -o profile_prom  -pf png -if example_05.png -e -V 2 -fc 2
 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4
 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
 |-- 17:50-DEBUG-profile : Using numpy version 1.13.3
 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
 |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
 |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_86ftqnm9
 |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_86ftqnm9/mini_real_promoter
 |-- 17:50-INFO-profile : Getting configuration info from input file.
 |-- 17:50-INFO-profile : Reading transcript file.
 |-- 17:50-INFO-profile : Deleting duplicates in transcript-file.
 |-- 17:50-INFO-profile : Checking how many genes where found in the transcript list.
 |-- 17:50-INFO-profile : Keeping 804 transcript out of 833.
 |-- 17:50-DEBUG-profile : Color order :['H3K79me', 'H3K4me3', 'H3K36me3']
 |-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
 |-- 17:50-INFO-profile : Searching coverage columns.
 |-- 17:50-INFO-profile : Melting.
 |-- 17:50-INFO-profile : Ceiling
 |-- 17:50-INFO-profile : Computing column ordering.
 |-- 17:50-INFO-profile : Preparing diagram
 |-- 17:50-INFO-profile : Theming and ordering. Please be patient...
 |-- 17:50-INFO-profile : Preparing x axis
 |-- 17:50-INFO-profile : facet_col 2
 |-- 17:50-INFO-profile : Page width set to 6
 |-- 17:50-INFO-profile : Page height set to 5.0
 |-- 17:50-INFO-profile : Saving diagram to file : example_05.png
 |-- 17:50-INFO-profile : Be patient. This may be long for large datasets.
example_05.png

Alternatively, the groups can be set to chromosomes or transcript classes:

$ gtftk profile -D -i mini_real_promoter.zip -g tx_classes -f bwig -fo -t tx_classes.txt -o profile_prom  -pf png -if example_06.png -V 2 -nm ranging
 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4
 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
 |-- 17:50-DEBUG-profile : Using numpy version 1.13.3
 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
 |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
 |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_9k2idgi4
 |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_9k2idgi4/mini_real_promoter
 |-- 17:50-INFO-profile : Getting configuration info from input file.
 |-- 17:50-INFO-profile : Reading transcript file.
 |-- 17:50-INFO-profile : Deleting duplicates in transcript-file.
 |-- 17:50-INFO-profile : Checking how many genes where found in the transcript list.
 |-- 17:50-INFO-profile : Keeping 804 transcript out of 833.
 |-- 17:50-DEBUG-profile : Color order :['lincRNA', 'protein_coding', 'antisense']
 |-- 17:50-DEBUG-profile : Profile color :['#000000', '#00bb00', '#cccccc']
 |-- 17:50-INFO-profile : Searching coverage columns.
 |-- 17:50-INFO-profile : Melting.
 |-- 17:50-INFO-profile : Ceiling
 |-- 17:50-INFO-profile : Normalizing (ranging)
 |-- 17:50-INFO-profile : Computing column ordering.
 |-- 17:50-INFO-profile : Preparing diagram
 |-- 17:50-INFO-profile : Theming and ordering. Please be patient...
 |-- 17:50-INFO-profile : Preparing x axis
 |-- 17:50-INFO-profile : facet_col 3
 |-- 17:50-INFO-profile : Page width set to 9
 |-- 17:50-INFO-profile : Page height set to 2.0
 |-- 17:50-INFO-profile : Saving diagram to file : example_06.png
 |-- 17:50-INFO-profile : Be patient. This may be long for large datasets.
example_06.png
$ gtftk profile -D -i mini_real_promoter.zip -g chrom -f bwig -fo -t tx_classes.txt -o profile_prom  -pf png -if example_06b.png -V 2 -nm ranging
 |-- 17:50-DEBUG-profile : Using pandas version 0.23.4
 |-- 17:50-DEBUG-profile : Pandas location /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/pandas/__init__.py
 |-- 17:50-DEBUG-profile : Using numpy version 1.13.3
 |-- 17:50-DEBUG-profile : Pandas numpy /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/numpy/__init__.py
 |-- 17:50-DEBUG-profile : Using plotnine version 0.4.0
 |-- 17:50-DEBUG-profile : Pandas plotnine /Users/puthier/miniconda3/envs/pygtftk_py3k/lib/python3.6/site-packages/plotnine/__init__.py
 |-- 17:50-DEBUG-profile : Uncompressing : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_z89mi377
 |-- 17:50-DEBUG-profile : Reading : /var/folders/nl/stqvvbcn4pg9v65k8kyjv5yh0000gn/T/pygtftk_matrix_z89mi377/mini_real_promoter
 |-- 17:50-INFO-profile : Getting configuration info from input file.
 |-- 17:50-DEBUG-profile : Color order :['chr11', 'chr15', 'chr2', 'chr3', 'chr13', 'chr14', 'chr1', 'chr6', 'chr4', 'chr9', 'chr21', 'chr8', 'chr7', 'chr10', 'chr17', 'chr18', 'chrX', 'chr19', 'chr12', 'chr22', 'chr5', 'chr20', 'chr16']
 |-- 17:50-DEBUG-profile : Profile color :['#000000', '#6c007c', '#850096', '#2500a5', '#0000ca', '#0041dd', '#0086dd', '#009fca', '#00aaa1', '#00a76f', '#009c00', '#00bb00', '#00da00', '#00f900', '#88ff00', '#dbf400', '#f7db00', '#ffb500', '#ff6100', '#f60000', '#da0000', '#cc1313', '#cccccc']
 |-- 17:50-INFO-profile : Searching coverage columns.
 |-- 17:50-INFO-profile : Melting.
 |-- 17:50-INFO-profile : Ceiling
 |-- 17:51-INFO-profile : Normalizing (ranging)
 |-- 17:51-INFO-profile : Computing column ordering.
 |-- 17:51-INFO-profile : Preparing diagram
 |-- 17:51-INFO-profile : Theming and ordering. Please be patient...
 |-- 17:51-INFO-profile : Preparing x axis
 |-- 17:51-INFO-profile : facet_col 3
 |-- 17:51-INFO-profile : Page width set to 9
 |-- 17:51-INFO-profile : Page height set to 2.0
 |-- 17:51-INFO-profile : Saving diagram to file : example_06b.png
 |-- 17:51-INFO-profile : Be patient. This may be long for large datasets.
example_06b.png

Note that facets may also be associated to epigenetic marks. In this case each the –group-by can be set to tx_classes or chrom.

$ gtftk profile -D -i mini_real_tx_2.zip -g tx_classes -t tx_classes.txt -f bwig  -o profile_tx -pf png -if example_07.png  -fo -w -nm ranging
example_07.png
$ gtftk profile -D -i mini_real_tx_2.zip -g chrom -f bwig  -o profile_tx -pf png -if example_08.png  -fo -w -nm ranging
example_08.png

Theming

The –theme argument controls plotnine theming.

$ gtftk profile -th classic -D -i mini_real_promoter.zip -g bwig -f chrom  -o profile_prom  -c "#66C2A5,#FC8D62,#8DA0CB,#6734AF" -pf png -if example_09b.png
example_09b.png
$ gtftk profile -th seaborn -D -i mini_real_promoter.zip -g bwig -f chrom  -o profile_prom   -c "#66C2A5,#FC8D62,#8DA0CB,#6734AF" -pf png -if example_10.png
example_10.png
$ gtftk profile -th matplotlib -D -i mini_real_promoter.zip -g bwig -f chrom  -o profile_prom   -c "#66C2A5,#FC8D62,#8DA0CB,#6734AF" -pf png -if example_11.png
example_11.png

Playing with various commands

It is also possible to use several of the previously seen commands to easily achieve more complexe analyses. Here we will plot the epigenetic signal according to RNA-seq counts.

$ gtftk join_attr -i mini_real_noov_rnd_tx.gtf  -j mini_real_counts_ENCFF630HEX.txt  -k gene_name -n counts | gtftk discretize_key -k  counts -n 6 -d count_levels -pu | gtftk tabulate -k transcript_id,count_levels -o tx_exp_classes.txt -Hun
 |-- 17:52-INFO-discretize_key : Categories: ['(-41.703_183.167]', '(183.167_574.667]', '(574.667_1035.0]', '(1035.0_1647.333]', '(1647.333_3212.667]', '(3212.667_41703.0]']
$ gtftk profile -D -i mini_real_tx.zip -o profile_tx -pf png -if example_12.png  -g tx_classes -t tx_exp_classes.txt -f bwig  -w -nm ranging -m viridis
example_12.png

Arguments:

$ gtftk profile -h
  Usage: gtftk profile -i MATRIX [-o DIR] [-t transcript_file] [-s {mean,median,sum,min,max}] [-e] [-c profile_colors] [-d color_order] [-g {bwig,tx_classes,chrom}] [-f {bwig,tx_classes,chrom}] [-pw page_width] [-ph page_height] [-pf {pdf,png}] [-lw line_width] [-bc border_color] [-x x_lab] [-at axis_text] [-st strip_text] [-u subset_bwig] [-fc facet_col] [-fo] [-w] [-if user_img_file] [-ul upper_limit] [-nm {none,ranging}] [-tl] [-ti title] [-dpi dpi] [-th {538,bw,grey,gray,linedraw,light,dark,minimal,classic,void,test,matplotlib,seaborn}] [-m palette] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Produces bigWig coverage profiles using calls to plotnine graphic package.

  Notes:
     *  The ranging normalization method [1] implies the following transformation:
     *  -  (x_i - min(x))/(max(x) - min(x)).
     *  Think about using normalized bigWig files as input to mk_matrix. This will limit the
     requirement for an additional normalization step (see Deeptools for a set of useful methods
     implemented in bamCoverage/bamCompare).

  References:
     *  [1] Numerical Ecology - second Edition - P. Legendre, L. Legendre (1998) Elsevier.

  Version:  2018-01-20

Arguments:
 -i, --inputfile              A zip file containing a matrix as produced by mk_matrix. (default: None)
 -o, --out-dir                Output directory name. (default: draw_profile)
 -t, --transcript-file        A two columns file with the transcripts of interest and their classes. (default: None)
 -s, --stat                   The statistics to be computed. (default: mean)
 -e, --confidence-interval    Add a confidence interval to estimate standard error of the mean. (default: False)
 -c, --profile-colors         Colors. (default: None)
 -d, --color-order            Factor ordering. Comma separated bwig labels or tx classes. (default: None)
 -g, --group-by               The variable used for grouping. (default: None)
 -f, --facet-var              The variable to be used for splitting into facets. (default: None)
 -pw, --page-width            Output pdf file width (e.g. 7 inches). (default: None)
 -ph, --page-height           Output file height (e.g. 5 inches). (default: None)
 -pf, --page-format           Output file format. (default: pdf)
 -lw, --line-width            Line width. (default: 1.25)
 -bc, --border-color          Border color for the plot. (default: #777777)
 -x, --x-lab                  X axis label. (default: Selected genomic regions)
 -at, --axis-text             Size of axis text. (default: 8)
 -st, --strip-text            Size of strip text. (default: 8)
 -u, --subset-bwig            Use only a subset of the bigwigs for plotting (default: None)
 -fc, --facet-col             Number of facet columns. (default: 4)
 -fo, --force-tx-class        Force even if some transcripts from --transcript-file were not found. (default: False)
 -w, --show-group-number      Show the number of element per group. (default: False)
 -if, --user-img-file         Provide an alternative path for the image. (default: None)
 -ul, --upper-limit           Upper limit based on quantile computed from unique values. (default: 0.95)
 -nm, --normalization-method  The normalization method performed on a per bigwig basis. (default: none)
 -tl, --to-log                Control whether the data should be log2-transform before plotting. (default: False)
 -ti, --title                 A title for the diagram. (default: )
 -dpi, --dpi                  Dpi to use. (default: 300)
 -th, --theme-plotnine        The theme for plotnine diagram. (default: bw)
 -m, --palette                A color palette (see: https://tinyurl.com/ydacyfxx). (default: nipy_spectral)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Increase output verbosity. (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)

Commands from section ‘miscellaneous’

col_from_tab

Description: Select columns from a tabulated file based on their names.

Example:

$ gtftk get_example | gtftk tabulate -k all |gtftk col_from_tab -c start,end,seqid | head -n 20
start	end	seqid

Arguments:

$ gtftk col_from_tab -h
  Usage: gtftk col_from_tab [-i GTF] [-o TXT] -c columns [-n] [-u] [-s SEP] [-H] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Select columns from a tabulated file based on their names.

  Version:  2018-01-20

Arguments:
 -i, --inputfile     The tabulated file. Default to STDIN (default: <stdin>)
 -o, --outputfile    Output file. (default: <stdout>)
 -c, --columns       The list (csv) of column names. (default: None)
 -n, --invert-match  Not/invert match. (default: False)
 -u, --unique        Write non redondant lines. (default: False)
 -s, --separator     The separator to be used for separating name elements (see -n). (default: )
 -H, --no-header     Don't print the header line. (default: False)

Command-wise optional arguments:
 -h, --help          Show this help message and exit.
 -V, --verbosity     Increase output verbosity. (default: 0)
 -D, --no-date       Do not add date to output file names. (default: False)
 -C, --add-chr       Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir       Keep all temporary files into this folder. (default: None)
 -A, --keep-all      Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file   Stores the arguments passed to the command into a file. (default: None)

control_list

Description: Returns a list of gene matched for expression based on reference values. Based on a reference gene list (or more generally IDs) this command tries to extract a set of other genes/IDs matched for signal/expression. The –reference-gene-file contains the list of reference IDs while the –inputfile contains a tuple gene/signal for all genes.

Example:

$ gtftk control_list -i mini_real_counts_ENCFF630HEX.txt -r mini_real_control_1.txt -D -V 2 -s -l -p 1 -ju -if example_13.png -pf png
 |-- 17:52-INFO-control_list : 0 duplicate lines have been deleted in reference file.
 |-- 17:52-INFO-control_list : Found 50 genes of the reference in the provided signal file
 |-- 17:52-INFO-control_list : All reference genes were found.
 |-- 17:52-INFO-control_list : Searching for genes with matched signal.
 |-- 17:53-INFO-control_list : Preparing a dataframe for plotting.
 |-- 17:53-INFO-control_list : Saving diagram to file : example_13.png
 |-- 17:53-INFO-control_list : Be patient. This may be long for large datasets.
_images/example_13.png

Arguments:

$ gtftk control_list -h
  Usage: gtftk control_list --in-file TXT --referenceGeneFile TXT [--out-dir DIR] [--log2] [--pseudo-count pseudo_count] [-pw page_width] [-ph page_height] [-pf {pdf,png}] [-dpi dpi] [--skip-first] [--rug] [--jitter] [-if user_img_file] [-c set_colors] [-h] [-V ] [-D] [-C] [-K] [-A] [-L]

  Description: 
     *  Based on a reference gene list (or more generally IDs) this command tries to extract a set of
     other genes/IDs matched for signal/expression. The --reference-gene-file contains the list of
     reference IDs while the --inputfile contains a tuple gene/signal for all genes.

  Notes:
     *  --infile is a two columns tabulated file. The first column contains the list of ids
     (including reference IDs) and the second column contains the expression/signal values. This
     file should contain no header.
     *  Think about discarding any unwanted IDs from --infile before calling control_list.

  Version:  2018-01-20

optional arguments:
 --in-file, -i            A two columns tab-file. See notes. (default: None)
 --referenceGeneFile, -r  The file containing the reference gene list (1 column, transcript ids). No header. (default: None)
 --out-dir, -o            Name of the output directory. (default: control_list)
 --log2, -l               If selected, data will be log transformed. (default: False)
 --pseudo-count, -p       The value for a pseudo-count to be added. (default: 0)
 -pw, --page-width        Output pdf file width (e.g. 7 inches). (default: None)
 -ph, --page-height       Output file height (e.g. 5 inches). (default: None)
 -pf, --page-format       Output file format. (default: pdf)
 -dpi, --dpi              Dpi to use. (default: 300)
 --skip-first, -s         Indicates that infile hase a header. (default: False)
 --rug, -u                Add rugs to the diagram. (default: False)
 --jitter, -j             Add jittered points. (default: False)
 -if, --user-img-file     Provide an alternative path for the image. (default: None)
 -c, --set-colors         Colors for the two sets (comma separated). (default: #b2df8a,#6a3d9a)

Command-wise optional arguments:
 -h, --help               Show this help message and exit.
 -V, --verbosity          Increase output verbosity. (default: 0)
 -D, --no-date            Do not add date to output file names. (default: False)
 -C, --add-chr            Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir            Keep all temporary files into this folder. (default: None)
 -A, --keep-all           Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file        Stores the arguments passed to the command into a file. (default: None)