4.3. Measuring the frequency of signed distances between SNPs and nearest DHSes

In this example, we would like to find the signed distance between a single nucleotide repeat and the DNase-hypersensitive site nearest to it, as measured in base pairs (bp).

4.3.1. BEDOPS tools in use

To find nearest elements, we will use closest-features with the --dist, --closest, and --no-ref options.

4.3.2. Script

SNPs are in a BED-formatted file called SNPs.bed sorted lexicographically with sort-bed. The DNase-hypersensitive sites are stored in a sorted BED-formatted file called DHSs.bed. These two files are available in the Downloads section.

# author : Eric Rynes
closest-features --dist --closest --no-ref SNPs.bed DHSs.bed \
    | cut -f2 -d '|' \
    | grep -w -F -v -e "NA" \
    > answer.bed

4.3.3. Discussion

The --dist option returns signed distances between input elements and reference elements, --closest chooses the single closest element, and --no-ref keeps SNP coordinates from being printed out.

The output from closest-features contains coordinates and the signed distance to the closest DHS, separated by the pipe (|) character. Such output might look something like this:

chr1    2513240 2513390 MCV-11  97.201400|25

This type of result is chopped up with the standard UNIX utility cut to get at the distances to the closest elements. Finally, we use grep -v to throw out any non-distance, denoted by NA. This can occur if there exists some chromosome in the SNP dataset that does not exist in the DHSs.

Thus, for every SNP, we have a corresponding distance to nearest DHS. As an example, from this data we could build a histogram showing the frequencies of distances-to-nearest-DHS.

4.3.4. Downloads

The closest-features tool can operate directly on Starch-formatted archives. Alternatively, use the unstarch tool to decompress Starch data files to sorted BED format.