6.3.3.5. gff2bed

The gff2bed script converts 1-based, closed [start, end] General Feature Format v3 (GFF3) to sorted, 0-based, half-open [start-1, end) extended BED-formatted data.

For convenience, we also offer gff2starch, which performs the extra step of creating a Starch-formatted archive.

6.3.3.5.1. Dependencies

The gff2bed script requires convert2bed. The gff2starch script requires starch. Both dependencies are part of a typical BEDOPS installation.

This script is also dependent on input that follows the GFF3 specification. A GFF3-format validator is available here to ensure your input follows specification.

Tip

Conversion of data which are GFF-like, but which do not follow the specification can cause parsing issues. If you run into problems, please check that your input follows the GFF3 specification. Tools such as the GFF3 Online Validator are useful for this task.

6.3.3.5.2. Source

The gff2bed and gff2starch conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.

6.3.3.5.3. Usage

The gff2bed script parses GFF3 from standard input and prints sorted BED to standard output. The gff2starch script uses an extra step to parse GFF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.

The header data of a GFF file is usually discarded, unless you add the --keep-header option. In this case, BED elements are created from these data, using the chromosome name _header to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).

Tip

By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort option. Run the script with the --help option for more details.

Tip

If sorting converted data larger than system memory, use the --max-mem option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G or similar. See --help for more details.

6.3.3.5.4. Example

To demonstrate these scripts, we use a sample GFF input called foo.gff (see the Downloads section to grab this file).

##gff-version 3
chr1    Canada  exon    1300    1300    .       +       .       ID=exon00001;score=1
chr1    USA     exon    1050    1500    .       -       0       ID=exon00002;Ontology_term="GO:0046703";Ontology_term="GO:0046704"
chr1    Canada  exon    3000    3902    .       ?       2       ID=exon00003;score=4;Name=foo
chr1    .       exon    5000    5500    .       .       .       ID=exon00004;Gap=M8 D3 M6 I1 M6
chr1    .       exon    7000    9000    10      +       1       ID=exon00005;Dbxref="NCBI_gi:10727410"

We can convert it to sorted BED data in the following manner:

$ gff2bed < foo.gff3
chr1    1049    1500    exon00002       .       -       USA     exon    0       ID=exon00002;Ontology_term="GO:0046703";Ontology_term="GO:0046704"
chr1    1299    1300    exon00001       .       +       Canada  exon    .       ID=exon00001;score=1;zeroLengthInsertion=True
chr1    2999    3902    exon00003       .       ?       Canada  exon    2       ID=exon00003;score=4;Name=foo
chr1    4999    5500    exon00004       .       .       .       exon    .       ID=exon00004;Gap=M8 D3 M6 I1 M6
chr1    6999    9000    exon00005       10      +       .       exon    1       ID=exon00005;Dbxref="NCBI_gi:10727410"

The default usage strips the leading pragma, or header (##gff-version 3), but adding the --keep-header option will preserve this as a BED element that uses _header as a chromosome name:

$ gff2bed --keep-header < foo.gff3
_header 0       1       ##gff-version 3
chr1    1049    1500    exon00002       .       -       USA     exon    0       ID=exon00002;Ontology_term="GO:0046703";Ontology_term="GO:0046704"
chr1    1299    1300    exon00001       .       +       Canada  exon    .       ID=exon00001;score=1;zero_length_insertion=True
chr1    2999    3902    exon00003       .       ?       Canada  exon    2       ID=exon00003;score=4;Name=foo
chr1    4999    5500    exon00004       .       .       .       exon    .       ID=exon00004;Gap=M8 D3 M6 I1 M6
chr1    6999    9000    exon00005       10      +       .       exon    1       ID=exon00005;Dbxref="NCBI_gi:10727410"

Note

Zero-length insertion elements are given an extra attribute called zeroLengthInsertion which lets a BED-to-GFF or other parser know that the element will require conversion back to a right-closed element [a, b], where a and b are equal.

Note

Note the conversion from 1- to 0-based coordinate indexing, in the transition from GFF3 to BED. BEDOPS supports operations on input with any coordinate indexing, but the coordinate change made here is believed to be convenient for most end users.

6.3.3.5.5. Column mapping

In this section, we describe how GFF3 columns are mapped to BED columns. We start with the first six UCSC BED columns as follows:

GFF3 field BED column index BED field
seqid 1 chromosome
start 2 start
end 3 stop
ID (via attributes) 4 id
score 5 score
strand 6 strand

The remaining columns are mapped as follows:

GFF3 field BED column index BED field
source 7  
type 8  
phase 9  
attributes 10  

If we encounter zero-length insertion elements (which are defined where the start and stop GFF3 field values are equivalent), the start coordinate is decremented to convert to 0-based, half-open indexing, and a zero_length_insertion attribute is added to the attributes GFF3 field value.

6.3.3.5.6. Downloads