6.3.3.9. rmsk2bed¶
The rmsk2bed
script converts 1-based, closed [start, end]
RepeatMasker annotation output (OUT) to sorted, 0-based, half-open [start-1, end)
extended BED-formatted data.
For convenience, we also offer rmsk2starch
, which performs the extra step of creating a Starch-formatted archive.
6.3.3.9.1. Dependencies¶
The rmsk2bed
script requires convert2bed. The rmsk2starch
script requires starch. Both dependencies are part of a typical BEDOPS installation.
This script is also dependent on input that follows the RepeatMasker annotation output specification, outlined here: http://www.repeatmasker.org/webrepeatmaskerhelp.html.
6.3.3.9.2. Source¶
The rmsk2bed
and rmsk2starch
conversion scripts are part of the binary and source downloads of BEDOPS. See the Installation documentation for more details.
6.3.3.9.3. Usage¶
The rmsk2bed
script parses RepeatMasker annotation output from standard input and prints sorted BED to standard output. The rmsk2starch
script uses an extra step to parse RepeatMasker annotation output to a compressed BEDOPS Starch-formatted archive, which is also directed to standard output.
The header data of a RepeatMasker annotation output file is usually discarded, unless you add the --keep-header
option. In this case, BED elements are created from these data, using the chromosome name _header
to denote content. Line numbers are specified in the start and stop coordinates, and unmodified header data are placed in the fourth column (ID field).
Tip
By default, all conversion scripts now output sorted BED data ready for use with BEDOPS utilities. If you do not want to sort converted output, use the --do-not-sort
option. Run the script with the --help
option for more details.
Tip
If sorting converted data larger than system memory, use the --max-mem
option to limit sort memory usage to a reasonable fraction of available memory, e.g., --max-mem 2G
or similar. See --help
for more details.
6.3.3.9.4. Example¶
To demonstrate these scripts, we use a sample RepeatMasker annotation output input called foo.out
(see the Downloads section to grab this file).
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
...
1320 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 337 104 20
12279 10.5 2.1 1.7 HSU08988 6782 7718 (21525) C Tigger1 DNA/MER2_type (0) 2418 1486 19
1769 12.9 6.6 1.9 HSU08988 7719 8022 (21221) C AluSx SINE/Alu (0) 317 1 17
12279 10.5 2.1 1.7 HSU08988 8023 8694 (20549) C Tigger1 DNA/MER2_type (932) 1486 818 19
2335 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/Alu (5) 305 1 18
12279 10.5 2.1 1.7 HSU08988 9001 9695 (19548) C Tigger1 DNA/MER2_type (1600) 818 2 19
721 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/MER2_type (224) 122 2 20
We can convert it to sorted BED data in the following manner:
$ rmsk2bed < foo.out
HSU08988 6562 6781 MER7A 1320 - 15.6 6.2 0.0 (22462) DNA/MER2_type (0) 337 104 20
HSU08988 6781 7718 Tigger1 12279 - 10.5 2.1 1.7 (21525) DNA/MER2_type (0) 2418 1486 19
HSU08988 7718 8022 AluSx 1769 - 12.9 6.6 1.9 (21221) SINE/Alu (0) 317 1 17
HSU08988 8022 8694 Tigger1 12279 - 10.5 2.1 1.7 (20549) DNA/MER2_type (932) 1486 818 19
HSU08988 8694 9000 AluSg 2335 - 11.1 0.3 0.7 (20243) SINE/Alu (5) 305 1 18
HSU08988 9000 9695 Tigger1 12279 - 10.5 2.1 1.7 (19548) DNA/MER2_type (1600) 818 2 19
HSU08988 9695 9816 MER7A 721 - 21.2 1.4 0.0 (19427) DNA/MER2_type (224) 122 2 20
Note
Use bedops –merge to merge elements, e.g.: rmsk2bed < foo.out | bedops --merge - > merged_repeatmasker_elements.bed
As shown above, we strip the header element, but adding the --keep-header
option will preserve this header as a BED element that uses _header
as a chromosome name:
$ rmsk2bed --keep-header < foo.out
HSU08988 6562 6781 MER7A 1320 - 15.6 6.2 0.0 (22462) DNA/MER2_type (0) 337 104 20
HSU08988 6781 7718 Tigger1 12279 - 10.5 2.1 1.7 (21525) DNA/MER2_type (0) 2418 1486 19
HSU08988 7718 8022 AluSx 1769 - 12.9 6.6 1.9 (21221) SINE/Alu (0) 317 1 17
HSU08988 8022 8694 Tigger1 12279 - 10.5 2.1 1.7 (20549) DNA/MER2_type (932) 1486 818 19
HSU08988 8694 9000 AluSg 2335 - 11.1 0.3 0.7 (20243) SINE/Alu (5) 305 1 18
HSU08988 9000 9695 Tigger1 12279 - 10.5 2.1 1.7 (19548) DNA/MER2_type (1600) 818 2 19
HSU08988 9695 9816 MER7A 721 - 21.2 1.4 0.0 (19427) DNA/MER2_type (224) 122 2 20
_header 0 1 SW perc perc perc query position in query matching repeat position in repeat
_header 1 2 score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
_header 2 3 ...
Note
Note the conversion from 1- to 0-based coordinate indexing, in the transition from RepeatMasker annotation output to BED. BEDOPS supports operations on input with any coordinate indexing, but the coordinate change made here is believed to be convenient for most end users.
6.3.3.9.5. Column mapping¶
In this section, we describe how RepeatMasker annotation columns are mapped to BED columns. We start with the first six UCSC BED columns as follows:
RepeatMasker annotation field | BED column index | BED field |
---|---|---|
Query sequence | 1 | chromosome |
Query start | 2 | start |
Query end | 3 | stop |
Repeat name | 4 | id |
Smith-Waterman score | 5 | score |
Strand | 6 | strand |
The remaining columns are mapped as follows:
RepeatMasker annotation field | BED column index | BED field |
---|---|---|
Percentage, substitutions | 7 | |
Percentage, deleted bases | 8 | |
Percentage, inserted bases | 9 | |
Bases in query, past match | 10 | |
Repeat class | 11 | |
Bases in complement of the repeat consensus sequence | 12 | |
Match start | 13 | |
Match end | 14 | |
Unique ID | 15 | |
Higher-scoring match (optional) | 16 |
6.3.3.9.6. Downloads¶
- Sample RepeatMasker annotation dataset:
foo.out