6.3.3.6. gtf2bed¶
The gtf2bed
script converts 1-based, closed [start, end]
Gene Transfer Format v2.2 (GTF2.2) to sorted, 0-based, half-open [start-1, end)
extended BED-formatted data.
For convenience, we also offer gtf2starch
, which performs the extra step of creating a Starch-formatted archive.
6.3.3.6.1. Dependencies¶
The gtf2bed
script requires convert2bed. The gtf2starch
script requires starch. Both dependencies are part of a typical BEDOPS installation.
This script is also dependent on input that follows the GTF 2.2 specification. A GTF-format validator is available here to ensure your input follows specification.
Tip
Conversion of data which are GTF-like, but which do not follow the specification can cause parsing issues. If you run into problems, please check that your input follows the GTF specification.
6.3.3.6.2. Source¶
The gtf2bed
and gtf2starch
conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.
6.3.3.6.3. Usage¶
The gtf2bed
script parses GTF from standard input and prints sorted BED to standard output. The gtf2starch
script uses an extra step to parse GTF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.
Tip
By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort
option. Run the script with the --help
option for more details.
Tip
If sorting converted data larger than system memory, use the --max-mem
option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G
or similar. See --help
for more details.
6.3.3.6.4. Example¶
To demonstrate these scripts, we use a sample GTF input called foo.gtf
(see the Downloads section to grab this file).
chr20 protein_coding exon 9874841 9874841 . + . gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT0000002.4.36"; gene_name "ZNF366";
chr20 protein_coding CDS 9873504 9874841 . + 0 gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT0000002.4.36"; gene_name "ZNF366";
chr20 protein_coding exon 9877488 9877679 . + . gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT0000002.4.36";
We can convert it to sorted BED data in the following manner:
$ gtf2bed < foo.gtf
chr20 9874840 9874841 ZNF366 . + protein_coding exon . gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT0000002.4.36"; gene_name "ZNF366"; zero_length_insertion "True";
chr20 9873503 9874841 ZNF366 . + protein_coding CDS 0 gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT0000002.4.36"; gene_name "ZNF366";
chr20 9877487 9877679 ENSBTAG00000020601 . + protein_coding exon . gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT0000002.4.36";
Tip
After, say, performing set or statistical operations with bedops, bedmap etc., converting data back to GTF is accomplished through an awk
statement that re-orders columns and shifts the coordinate index:
$ awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' foo_subset.bed > foo_subset.gtf
Note
Zero-length insertion elements are given an extra attribute called zero_length_insertion
which lets a BED-to-GTF or other parser know that the element will require conversion back to a right-closed element [a, b]
, where a
and b
are equal.
Note
Note the conversion from 1- to 0-based coordinate indexing, in the transition from GTF to BED. BEDOPS supports operations on input with any coordinate indexing, but the coordinate change made here is believed to be convenient for most end users.
6.3.3.6.5. Column mapping¶
In this section, we describe how GTF2.2 columns are mapped to BED columns. We start with the first six UCSC BED columns as follows:
GFF2.2 field | BED column index | BED field |
---|---|---|
seqname | 1 | chromosome |
start | 2 | start |
end | 3 | stop |
gene_id | 4 | id |
score | 5 | score |
strand | 6 | strand |
The remaining columns are mapped as follows:
GFF2.2 field | BED column index | BED field |
---|---|---|
source | 7 | |
feature | 8 | |
frame | 9 | |
attributes | 10 |
If present in the GTF2.2 input, the following column is also mapped:
GFF2.2 field | BED column index | BED field |
---|---|---|
comments | 11 |
If we encounter zero-length insertion elements (which are defined where the start
and stop
GFF3 field values are equivalent), the start
coordinate is decremented to convert to 0-based, half-open indexing, and a zero_length_insertion
attribute is added to the attributes
GTF2.2 field value.