6.3.2.3. starchcat¶
The starchcat
utility efficiently merges per-chromosome records contained within one or more BEDOPS Starch-formatted archives. This is an equivalent operation to bedops --everything
or bedops -u
(a multiset union), but inputs are starch archives rather than uncompressed BED files.
As a further advantage to using this over bedops, in the case where a starch input contains BED elements exclusive to one chromosome, this utility will directly and quickly copy over compressed elements to a new archive, avoiding the need for costly and wasteful extraction and re-compression.
In the general case, where two or more starch inputs contain BED elements from the same chromosome, a sorted merge is performed and the stream reprocessed into a Starch-formatted archive.
6.3.2.3.1. Parallelization¶
Those with access to a computational cluster such as an Oracle/Sun Grid Engine or a group of hosts running SSH services should find starchcat highly useful, as this facilitates:
- Much faster compression of an entire genome of BED data, using nodes of a computational cluster to compress separate chromosomes, followed by a collation step with starchcat (see the Efficiently creating Starch-formatted archives with a cluster documentation).
- Extraction, manipulation and reintegration of a starch -ed chromosome into a larger starch archive
- Refreshing metadata or re-compressing the data within a lone starch archive.
To demonstrate the first application of this utility, we have packaged a helper script with the BEDOPS suite called starchcluster, which archives data much faster than starch alone. By distributing work across the nodes of a computational cluster, the upper bound on compression time is almost entirely determined by the largest chromosome, reducing compression time by an order of magnitude.
6.3.2.3.2. Inputs and outputs¶
6.3.2.3.2.1. Input¶
The input to starchcat consists of one or more BEDOPS Starch-formatted archive files.
Note
If a single archive is provided as input, it may be reprocessed with specified options. When two or more archives are specified, the output will be the equivalent of a multiset union of the inputs.
Note
This utility does not accept standard input.
6.3.2.3.2.2. Output¶
The starchcat tool outputs a starch -formatted archive to standard output, which is usually redirected to a file.
Additionally, an optional compression flag specifies if the final starch output should be compressed with either the bzip2
or gzip
method (the default being bzip2
).
Note
If starch inputs use a different backend compression method, the input stream is re-compressed before integrated into the larger archive. This will incur extra processing overhead.
6.3.2.3.3. Usage¶
Use the --help
option to list all options:
starchcat
citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
version: 2.4.36 (typical)
authors: Alex Reynolds and Shane Neph
USAGE: starchcat [ --note="..." ]
[ --bzip2 | --gzip ]
[ --omit-signature ]
[ --report-progress=N ] <starch-file-1> [<starch-file-2> ...]
* At least one lexicographically-sorted, headerless starch archive is
required.
* While two or more inputs make sense for a multiset union operation, you
can starchcat one file in order to update its metadata, recompress it
with a different backend method, or add a note annotation.
* Compressed data are sent to standard output. Use the '>' operator to
redirect to a file.
Process Flags
--------------------------------------------------------------------------
--note="foo bar..." Append note to output archive metadata (optional).
--bzip2 | --gzip Specify backend compression type (optional, default
is bzip2).
--omit-signature Skip generating per-chromosome data integrity signature
(optional, default is to generate signature).
--report-progress=N Report compression progress every N elements per
chromosome to standard error stream (optional)
--version Show binary version.
--help Show this usage message.
6.3.2.3.3.1. Per-chromosome data integrity signature¶
By default, a data integrity signature is generated for each chromosome. This can be used to verify if chromosome streams from two or more Starch archives are identical, or used to test the integrity of a chromosome, to identify potential data corruption.
Generating this signature adds to the computational cost of compression, or an integrity signature may not be useful for all archives. Add the --omit-signature
option, if the compression time is too high or the data integrity signature is not needed.
6.3.2.3.3.2. Example¶
Let’s say we have a set of 23 starch archives, one for each chromosome of the human genome: chr1.starch
, chr2.starch
, and so on, to chrY.starch
. (To simplify this example, we leave out mitochondrial, random, pseudo- and other chromosomes.) We would like to build a new starch archive from these 23 separate files:
$ starchcat chr1.starch chr2.starch ... chrY.starch > humanGenome.starch
The starchcat utility parses the metadata from each of the 23 inputs, determines what data to either simple copy or reprocess, and then it performs the merge. Cleanup is performed afterwards, as necessary, and the output is a brand new starch file, written to humanGenome.starch
.
Note
No filtering or processing is performed on extracted BED elements, before they are written to the final output. Thus, it is possible for duplicate BED elements to occur. It would be easy to use the --signature
option to validate the expected content of a new Starch archive.
However, the final archive is sorted per sort-bed ordering, so that data extracted from this archive will be ready for use with BEDOPS utilities.
Note
When input archives contain data on disjoint chromosomes, use of starchcat is very efficient as data are simply copied, instead of extracted and re-compressed.