6.3.3.8. psl2bed

The psl2bed script converts 0-based, half-open [start-1, end) Pattern Space Layout (PSL) to sorted, 0-based, half-open [start-1, end) extended BED-formatted data.

For convenience, we also offer psl2starch, which performs the extra step of creating a Starch-formatted archive.

6.3.3.8.1. Dependencies

The psl2bed script requires convert2bed. The psl2starch script requires starch. Both dependencies are part of a typical BEDOPS installation.

This script is also dependent on input that follows the PSL specification.

Tip

Conversion of data which are PSL-like, but which do not follow the specification can cause parsing issues. If you run into problems, please check that your input follows the PSL specification.

6.3.3.8.2. Source

The psl2bed and psl2starch conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.

6.3.3.8.3. Usage

The psl2bed script parses PSL from standard input and prints sorted BED to standard output. The psl2starch script uses an extra step to parse GFF to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.

The header data of a headered PSL file is usually discarded, unless you add the --keep-header option. In this case, BED elements are created from these data, using the chromosome name _header to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).

If your data contains a record with multiple blocks (block count is greater than one, and the tStarts field has multiple target start positions), you can use the --split option to print that record to separate BED elements, each with a start position defined by tStarts and a length defined by the associated value in the blockSizes string.

Tip

By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort option. Run the script with the --help option for more details.

Tip

If you are sorting data larger than system memory, use the --max-mem option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G or similar. See --help for more details.

6.3.3.8.4. Example

To demonstrate these scripts, we use a sample PSL input called foo.psl (see the Downloads section to grab this file).

psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes      qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
35      0       0       0       0       0       0       0       +       foo     50      15      50      chrX    155270560       40535836        40535871        1       35,     15,     40535836,
34      2       0       0       0       0       0       0       +       foo     50      14      50      chrX    155270560       68019028        68019064        1       36,     14,     68019028,
33      2       0       0       0       0       0       0       +       foo     50      14      49      chrX    155270560       43068135        43068170        1       35,     14,     43068135,
35      2       0       0       0       0       0       0       +       foo     50      13      50      chr8    146364022       131572122       131572159       1       37,     13,     131572122,
30      0       0       0       0       0       0       0       +       foo     50      14      44      chr6    171115067       127685756       127685786       1       30,     14,     127685756,
30      0       0       0       0       0       0       0       +       foo     50      14      44      chr6    171115067       93161871        93161901        1       30,     14,     93161871,
31      0       0       0       0       0       0       0       +       foo     50      13      44      chr5    180915260       119897315       119897346       1       31,     13,     119897315,
30      0       0       0       0       0       0       0       +       foo     50      14      44      chr5    180915260       1232.4.29       1232.4.295       1       30,     14,     1232.4.29,
...

We can convert it to sorted BED data in the following manner:

$ psl2bed < foo.psl
chr1    30571100        30571135        foo     50      -       35      0       0       0       0       0       0       0       15      50      249250621       1       35,     0,      30571100,
chr1    69592160        69592195        foo     50      -       34      1       0       0       0       0       0       0       15      50      249250621       1       35,     0,      69592160,
chr1    107200050       107200100       foo     50      +       50      0       0       0       0       0       0       0       0       50      249250621       1       50,     0,      107200050,
chr11   12618347        12618389        foo     50      +       39      3       0       0       0       0       0       0       8       50      135006516       1       42,     8,      12618347,
chr11   32933028        32933063        foo     50      +       35      0       0       0       1       1       0       0       8       44      135006516       2       4,31,   8,13,   32933028,32933032,
chr11   80116421        80116457        foo     50      +       35      1       0       0       0       0       0       0       14      50      135006516       1       36,     14,     80116421,
chr11   133952291       133952327       foo     50      +       34      2       0       0       0       0       0       0       14      50      135006516       1       36,     14,     133952291,
chr13   99729482        99729523        foo     50      +       39      2       0       0       0       0       0       0       8       49      115169878       1       41,     8,      99729482,
chr13   111391852       111391888       foo     50      +       34      2       0       0       0       0       0       0       14      50      115169878       1       36,     14,     111391852,
chr16   8149657 8149694 foo     50      +       36      1       0       0       0       0       0       0       13      50      90354753        1       37,     13,     8149657,
...

As you see here, the header data of a headered PSL file is discarded, unless you add the --keep-header option. In this case, BED elements are created from these data, using the chromosome name _header to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).

Here is a demonstration of conversion of the same headered input, adding the --keep-header option:

$ psl2bed --keep-header < foo.psl
_header 0       1       psLayout version 3
_header 1       2
_header 2       3       match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes      qStarts  tStarts
_header 3       4       match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
_header 4       5       ---------------------------------------------------------------------------------------------------------------------------------------------------------------
chr1    30571100        30571135        foo     50      -       35      0       0       0       0       0       0       0       15      50      249250621       1       35,     0,      30571100,
chr1    69592160        69592195        foo     50      -       34      1       0       0       0       0       0       0       15      50      249250621       1       35,     0,      69592160,
chr1    107200050       107200100       foo     50      +       50      0       0       0       0       0       0       0       0       50      249250621       1       50,     0,      107200050,
chr11   12618347        12618389        foo     50      +       39      3       0       0       0       0       0       0       8       50      135006516       1       42,     8,      12618347,
chr11   32933028        32933063        foo     50      +       35      0       0       0       1       1       0       0       8       44      135006516       2       4,31,   8,13,   32933028,32933032,
chr11   80116421        80116457        foo     50      +       35      1       0       0       0       0       0       0       14      50      135006516       1       36,     14,     80116421,
chr11   133952291       133952327       foo     50      +       34      2       0       0       0       0       0       0       14      50      135006516       1       36,     14,     133952291,
chr13   99729482        99729523        foo     50      +       39      2       0       0       0       0       0       0       8       49      115169878       1       41,     8,      99729482,
chr13   111391852       111391888       foo     50      +       34      2       0       0       0       0       0       0       14      50      115169878       1       36,     14,     111391852,
chr16   8149657 8149694 foo     50      +       36      1       0       0       0       0       0       0       13      50      90354753        1       37,     13,     8149657,
...

With this option, the psl2bed and psl2starch scripts are completely “non-lossy”. Use of awk or other scripting tools can munge these data back into a PSL-formatted file.

This example PSL file contains one record with a block count of 2. If we were to add the --split option, this record would be split into two separate BED elements that have start positions 32933028 and 32933032, with lengths 4 and 31, respectively. These elements fall within the genomic range already defined by the tStart and tEnd fields (32933028 and 32933063).

Note

The psl2bed and psl2starch scripts work with headered or headerless PSL data.

Note

By default, the psl2bed and psl2starch scripts assume that PSL data do not need splitting. If you expect your data to contain multiple blocks, add the --split option.

6.3.3.8.5. Column mapping

In this section, we describe how PSL columns are mapped to BED columns. We start with the first six UCSC BED columns as follows:

PSL field BED column index BED field
tName 1 chromosome
tStart(*) 2 start
tEnd(*) 3 stop
qName 4 id
matches 5 score
strand 6 strand

The remaining PSL columns are mapped, in order, to the remaining columns of the BED output:

PSL field BED column index BED field
qSize 7  
misMatches 8  
repMatches 9  
nCount 10  
qNumInsert 11  
qBaseInsert 12  
tNumInsert 13  
tBaseInsert 14  
qStart 15  
qEnd 16  
tSize 17  
blockCount 18  
blockSizes 19  
qStarts 20  
tStarts 21  

This is a lossless mapping. Because we have mapped all columns, we can translate converted BED data back to headerless PSL with a simple awk statement that permutes columns to PSL-based ordering:

$ awk 'BEGIN { OFS = "\t" } \
  { \
    print $5" "$8" "$9" "$10" "$11" "$12" "$13" "$14" "$6" "$4" "$7" "$15" "$16" "$1" "$17" "$2" "$3" "$18" "$19" "$20" "$21 }' converted.bed > original.psl

In the case where the --split option is added, the tStart and tEnd fields are replaced with each of the values in the larger tStarts string, added to the respective values in the larger blockSizes string. This is still a lossless conversion, but modifications to the awk script printed above would be required to rebuild the original PSL.

6.3.3.8.6. Downloads