6.3.3.7. gvf2bed

The gvf2bed script converts 1-based, closed [start, end] Genome Variation Format (GVF, a type of General Feature Format v3 or GFF3) to sorted, 0-based, half-open [start-1, end) extended BED-formatted data.

For convenience, we also offer gvf2starch, which performs the extra step of creating a Starch-formatted archive.

6.3.3.7.1. Dependencies

The gvf2bed script requires convert2bed. The gvf2starch script requires starch. Both dependencies are part of a typical BEDOPS installation.

This script is also dependent on input that follows the GFF3 specification. A GFF3-format validator is available here to ensure your input follows specification.

Tip

Conversion of data which are GFF-like, but which do not follow the specification can cause parsing issues. If you run into problems, please check that your input follows the GFF3 specification. Tools such as the GFF3 Online Validator are useful for this task.

6.3.3.7.2. Source

The gvf2bed and gvf2starch conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.

6.3.3.7.3. Usage

The gvf2bed script parses GVF from standard input and prints sorted BED to standard output. The gvf2starch script uses an extra step to parse GVF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.

The header data of a GVF file is usually discarded, unless you add the --keep-header option. In this case, BED elements are created from these data, using the chromosome name _header to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).

Tip

By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort option. Run the script with the --help option for more details.

Tip

If sorting converted data larger than system memory, use the --max-mem option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G or similar. See --help for more details.

6.3.3.7.4. Example

To demonstrate these scripts, we use a sample GVF input called foo.gvf (see the Downloads section to grab this file).

##gvf-version 1.07
##feature-ontology http://www.sequenceontology.org/resources/obo_files/current_release.obo
##multi-individual NA19240,NA18507,NA12878,NA19238
##genome-build NCBI B36.3
##sequence-region chr16 1 88827254

chr16 dbSNP   SNV     49291360        49291360        .       +       .       ID=ID_2;Variant_seq=C,G;Individual=0,1,2,3;Genotype=0:1,0:0,1:1,0:1;
chr16 dbSNP   SNV     49302125        49302125        .       +       .       ID=ID_3;Variant_seq=C,T;Individual=0,1,3;Genotype=0:1,2:2,0:2;
chr16 dbSNP   SNV     49302365        49302365        .       +       .       ID=ID_4;Variant_seq=G;Individual=0,1;Genotype=0:0,0:0;
chr16 dbSNP   SNV     49302700        49302700        .       +       .       ID=ID_5;Variant_seq=C,T;Individual=2,3;Genotype=0:1,0:0;
chr16 dbSNP   SNV     49303084        49303084        .       +       .       ID=ID_6;Variant_seq=T,G,A;Individual=3;Genotype=1,2:;
chr16 dbSNP   SNV     49303427        49303427        .       +       .       ID=ID_8;Variant_seq=T;Individual=0;Genotype=0:0;
chr16 dbSNP   SNV     49303596        49303596        .       +       .       ID=ID_9;Variant_seq=A,G,T;Individual=0,1,3;Genotype=1:2,3:3,1:3;

We can convert it to sorted BED data in the following manner:

$ gvf2bed < foo.gvf
chr16 49291359        49291360        ID_2    .       +       dbSNP   SNV     .       ID=ID_2;Variant_seq=C,G;Individual=0,1,2,3;Genotype=0:1,0:0,1:1,0:1;zero_length_insertion=True
chr16 49302124        49302125        ID_3    .       +       dbSNP   SNV     .       ID=ID_3;Variant_seq=C,T;Individual=0,1,3;Genotype=0:1,2:2,0:2;zero_length_insertion=True
chr16 49302364        49302365        ID_4    .       +       dbSNP   SNV     .       ID=ID_4;Variant_seq=G;Individual=0,1;Genotype=0:0,0:0;zero_length_insertion=True
chr16 49302699        49302700        ID_5    .       +       dbSNP   SNV     .       ID=ID_5;Variant_seq=C,T;Individual=2,3;Genotype=0:1,0:0;zero_length_insertion=True
chr16 49303083        49303084        ID_6    .       +       dbSNP   SNV     .       ID=ID_6;Variant_seq=T,G,A;Individual=3;Genotype=1,2:;zero_length_insertion=True
chr16 49303426        49303427        ID_8    .       +       dbSNP   SNV     .       ID=ID_8;Variant_seq=T;Individual=0;Genotype=0:0;zero_length_insertion=True
chr16 49303595        49303596        ID_9    .       +       dbSNP   SNV     .       ID=ID_9;Variant_seq=A,G,T;Individual=0,1,3;Genotype=1:2,3:3,1:3;zero_length_insertion=True

As shown, the default usage strips the leading pragmas (##gvf-version 1.07, etc.), but adding the --keep-header option will preserve pragmas as BED elements that use _header as a chromosome name:

$ gvf2bed --keep-header < foo.gvf
_header       0       1       ##gvf-version 1.07
_header       1       2       ##feature-ontology http://www.sequenceontology.org/resources/obo_files/current_release.obo
_header       2       3       ##multi-individual NA19240,NA18507,NA12878,NA19238
_header       3       4       ##genome-build NCBI B36.3
_header       4       5       ##sequence-region chr16 1 88827254
chr16 49291359        49291360        ID_2    .       +       dbSNP   SNV     .       ID=ID_2;Variant_seq=C,G;Individual=0,1,2,3;Genotype=0:1,0:0,1:1,0:1;zero_length_insertion=True
chr16 49302124        49302125        ID_3    .       +       dbSNP   SNV     .       ID=ID_3;Variant_seq=C,T;Individual=0,1,3;Genotype=0:1,2:2,0:2;zero_length_insertion=True
chr16 49302364        49302365        ID_4    .       +       dbSNP   SNV     .       ID=ID_4;Variant_seq=G;Individual=0,1;Genotype=0:0,0:0;zero_length_insertion=True
chr16 49302699        49302700        ID_5    .       +       dbSNP   SNV     .       ID=ID_5;Variant_seq=C,T;Individual=2,3;Genotype=0:1,0:0;zero_length_insertion=True
chr16 49303083        49303084        ID_6    .       +       dbSNP   SNV     .       ID=ID_6;Variant_seq=T,G,A;Individual=3;Genotype=1,2:;zero_length_insertion=True
chr16 49303426        49303427        ID_8    .       +       dbSNP   SNV     .       ID=ID_8;Variant_seq=T;Individual=0;Genotype=0:0;zero_length_insertion=True
chr16 49303595        49303596        ID_9    .       +       dbSNP   SNV     .       ID=ID_9;Variant_seq=A,G,T;Individual=0,1,3;Genotype=1:2,3:3,1:3;zero_length_insertion=True

Note

Zero-length insertion elements are given an extra attribute called zero_length_insertion which lets a BED-to-GVF or other parser know that the element will require conversion back to a right-closed element [a, b], where a and b are equal.

Note

Note the conversion from 1- to 0-based coordinate indexing, in the transition from GVF to BED. BEDOPS supports operations on input with any coordinate indexing, but the coordinate change made here is believed to be convenient for most end users.

6.3.3.7.5. Column mapping

In this section, we describe how GVF columns are mapped to BED columns. We start with the first six UCSC BED columns as follows:

GVF field BED column index BED field
seqid 1 chromosome
start 2 start
end 3 stop
ID (via attributes) 4 id
score 5 score
strand 6 strand

The remaining columns are mapped as follows:

GVF field BED column index BED field
source 7  
type 8  
phase 9  
attributes 10  

When we encounter zero-length insertion elements (which are defined where the start and stop GVF field values are equivalent), the start coordinate is decremented to convert to 0-based, half-open indexing, and a zero_length_insertion attribute is added to the attributes field value.

6.3.3.7.6. Downloads