6.1.2. bedextract

The bedextract utility performs three primary tasks, with the goal of doing them very quickly:

  1. Lists all the chromosomes in a sorted input BED file.
  2. Extracts all the elements in a sorted input BED file, for a given chromosome.
  3. Finds elements of one BED file, which overlap elements in a second, reference BED file (when specific element criteria are satisfied).

One might ask why use this utility, when the first two tasks can already be performed with common UNIX text processing tools, such as cut, sort, uniq, and awk, and the third task can be performed with bedops with the --element-of 1 options?

The bedextract utility does the work of all those tools without streaming through an entire BED file, resulting in massive performance improvements. By using the hints provided by sorted BED input, the bedextract tool can jump around, seeking very quick answers to these questions about your data.

6.1.2.1. How it works

Specifically, sorting with sort-bed allows us to perform a binary search:

  1. We jump to the middle byte of the BED file, stream to the nearest element, then parse and test the chromosome name.
  2. Either we have a match, or we jump to the middle of the remaining left or right half (decided by dictionary order), parse and test again.
  3. We repeat steps 1 and 2 until we have matches that define the bounds of the target chromosome.
../../../_images/reference_bedextract_mechanism.png

To indicate the kind of speed gain that the bedextract tool provides, in local testing, a naïve listing of chromosomes from a 36 GB BED input using UNIX cut and uniq utilities took approximately 20 minutes to complete on a typical Core 2 Duo-based Linux workstation. Retrieval of the same chromosome listing with bedextract --list-chr took only 2 seconds (cache flushed—no cheating!).

Tip

While listing chromosomes is perhaps a trivial task, 1200 seconds to 2 seconds is a 600-fold speedup. Similar improvements are gained from using --chrom and --faster options with other core BEDOPS tools like bedops and bedmap. If your data meet the criteria for using this approach—and a lot of genomic datasets do—we strongly encourage adding this to your toolkit.

6.1.2.2. Inputs and outputs

6.1.2.2.1. Input

Depending on specified options, bedextract requires one or two sorted BED files.

Note

It is critical that inputs are sorted as the information in a sorted file allows bedextract to do its work correctly. If your datasets are output from other BEDOPS tools, then they are already sorted!

6.1.2.2.2. Output

Depending on specified options, the bedextract program will send a list of chromosomes or BED elements to standard output.

Tip

The use of UNIX-like standard streams allows easy downstream analysis or post-processing with other tools and scripts, including other BEDOPS utilities.

6.1.2.3. Usage

The --help option describes the functionality available to the end user:

bedextract
  citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
  version:  2.4.27 (typical)
  authors:  Shane Neph & Alex Reynolds

    Every input file must be sorted per sort-bed.

 USAGE:
   0) --help or --version           Print requested info and exit successfully.
   1) --list-chr <input.bed>        Print all unique chromosome names found in <input.bed>.
   2) <chromosome> <input.bed>      Retrieve all rows for chr8 with:  bedextract chr8 <input.bed>.
   3) <query.bed> <target>          Grab elements from the <query.bed> that overlap elements in <target>. Same as
                                     `bedops -e 1 <query.bed> <target>`, except that this option fails silently
                                      if <query.bed> contains fully-nested BED elements.  If no fully-nested
                                      element exists, bedextract can vastly improve upon the performance of bedops.
                                      <target> may be a BED or Starch file (with or without fully-nested elements).
                                      Using '-' for <target> indicates input (in BED format) comes from stdin.

6.1.2.3.1. Listing chromosomes

Use the --list-chr option to quickly retrieve a listing of chromosomes from a given sorted BED input.

For example, the following lists the chromosomes in an example BED file of FIMO motif hits (see the Downloads section):

$ bedextract --list-chr motifs.bed
chr1
chr10
chr11
chr12
...
chr9
chrX

Note

The bedextract --list-chr operation only works on BED files. If you have a Starch file, use unstarch --list-chr to list its chromosomes.

6.1.2.3.2. Retrieving elements from a specific chromosome

To quickly retrieve the subset of elements from a sorted BED file associated with a given chromosome, apply the second usage case and specify the chromosome as the argument.

For example, to retrieve chrX from the same motif sample:

$ bedextract chrX motifs.bed
chrX    6775077 6775092 +V_SPZ1_01      4.92705e-06     +       GTTGGAGGGAAGGGC
chrX    6775168 6775179 +V_ELF5_01      8.57585e-06     +       TCAAGGAAGTA
chrX    6777790 6777799 +V_CKROX_Q2     8.90515e-06     +       TCCCTCCCC
...

Note

The bedextract <chromosome> operation only works on BED files. If you have a Starch file, use unstarch <chromosome> to list the elements associated with that chromosome.

6.1.2.3.3. Retrieving elements which overlap target elements

A common bedops query involves asking which elements overlap one or more bases between two BED datasets, which we will call here Query and Target.

One can already use bedops --element-of 1 to accomplish this task, but if certain specific criteria are met (which we will describe shortly) then a much faster result can often be obtained by instead using bedextract.

Three criteria make the use of bedextract in this mode very successful in practice, with potentially massive speed improvements:

  1. Query is a huge file.
  2. There are relatively few regions of interest in Target (say, roughly 30,000 or fewer).
  3. There are no fully-nested elements in Query (but duplicate coordinates are fine).

Note

With some extra work, it is possible to use this mode of bedextract with a huge Query BED file that includes fully-nested elements. The technique requires that you create a merged version of Query and keep that result, Query-Index, around along with Query.

$ bedops -m Query > Query-Index
$ bedextract Query-Index Target \
    | bedextract Query - \
    | bedops --element-of 1 - Target \
    > answer.bed

Note

You may change the final overlap criterion to the bedops –element-of as you see fit for your final answer. Adding –range 10, for example, to bedops -m Query > Query-Index can often greatly reduce the size of Query-Index without much of an increase in the running time of this approach.

6.1.2.3.3.1. What are nested elements?

For a precise definition of a nested element, refer to the documentation on nested elements.

For an example, we show the following sorted BED file:

chr1    1      100
chr1    100    200
chr1    125    150
chr1    150    1000

In this sorted dataset, the element chr1:125-150 is entirely nested within chr1:100-200:

../../../_images/reference_bedextract_nested_elements.png

Note

Fully-nested elements are not a problem for the other two bedextract features: 1) Listing all chromosomes, and 2) Retrieving all information for a single chromosome.

Fully-nested elements are only an issue for bedextract if they exist in the Query dataset. Results are not affected if the Target dataset contains nested elements. Overlapping (but not fully-nested) elements in the Query input file are fine, as are duplicated genomic positions.

Note

Our lab works with BED data of various types: cut-counts, hotspots, peaks, footprints, etc. These data generally do not contain nested elements and so are amenable to use with bedextract for extracting overlapping elements.

However, other types of Query datasets can be problematic. FIMO search results, for example, might cause trouble, where the boundaries of one motif hit can be contained within another larger hit. Or paired-end sequence data, where tags are not of a fixed length. Be sure to consider the makeup of your BED data before using bedextract.

6.1.2.3.3.2. Demonstration

To demonstrate this use of bedextract, for our Query dataset we will use the Map example from our bedmap documentation, which contains raw DNaseI hypersensitivity signal from a human K562 cell line (see the Downloads section for sample data):

$ cat query.bed
chr21   33031165        33031185        map-1   1.000000
chr21   33031185        33031205        map-2   3.000000
chr21   33031205        33031225        map-3   3.000000
chr21   33031225        33031245        map-4   3.000000
...
chr21   33032445        33032465        map-65  5.000000
chr21   33032465        33032485        map-66  6.000000

Our Target data is simply an ad-hoc BED region which overlaps part of the Query dataset, stored in a Starch-formatted archive:

$ unstarch target.starch
chr21   33031600        33031700

We can now ask which elements of Query overlap the element in Target:

$ bedextract query.bed target.starch
chr21   33031585        33031605        map-22  26.000000
chr21   33031605        33031625        map-23  27.000000
chr21   33031625        33031645        map-24  29.000000
chr21   33031645        33031665        map-25  31.000000
chr21   33031665        33031685        map-26  31.000000
chr21   33031685        33031705        map-27  37.000000

Our Target dataset is a Starch-formatted file. Note that we can also use “-” to denote standard input for the Target dataset, as well as a regular BED- or Starch-formatted file. In other words, we can pipe target elements from another process to bedextract, e.g. we can query for an ad-hoc element as follows:

$ echo -e "chr21\t33031590\t33031600" | bedextract query.bed -
chr21   33031585        33031605        map-22  26.000000

Instead of an ad-hoc element as in this example, however, target elements could just as easily be piped in from upstream bedmap or bedops operations, or extracted elements from a Starch archive, etc.

Tip

The output of this particular use of bedextract is made up of elements from the Query dataset and is therefore sorted BED data, which can be piped to bedops, bedmap and other BEDOPS utilities for further downstream processing.

Note

Though bedextract only supports the overlap equivalent of bedops --element-of 1, other overlap criteria are efficiently supported by combining bedextract with bedops.

Specifically, we can quickly filter through just the results given by bedextract and implement other overlap criteria with bedops, e.g.:

$ bedextract query.bed target.bed | bedops -e 50% - target.bed

6.1.2.4. Downloads