6.3.2.3. starchcat

The starchcat utility efficiently merges per-chromosome records contained within one or more BEDOPS Starch-formatted archives. This is an equivalent operation to bedops --everything or bedops -u (a multiset union), but inputs are starch archives rather than uncompressed BED files.

As a further advantage to using this over bedops, in the case where a starch input contains BED elements exclusive to one chromosome, this utility will directly and quickly copy over compressed elements to a new archive, avoiding the need for costly and wasteful extraction and re-compression.

In the general case, where two or more starch inputs contain BED elements from the same chromosome, a sorted merge is performed and the stream reprocessed into a Starch-formatted archive.

6.3.2.3.1. Parallelization

Those with access to a computational cluster such as an Oracle/Sun Grid Engine or a group of hosts running SSH services should find starchcat highly useful, as this facilitates:

To demonstrate the first application of this utility, we have packaged a helper script with the BEDOPS suite called starchcluster, which archives data much faster than starch alone. By distributing work across the nodes of a computational cluster, the upper bound on compression time is almost entirely determined by the largest chromosome, reducing compression time by an order of magnitude.

6.3.2.3.2. Inputs and outputs

6.3.2.3.2.1. Input

The input to starchcat consists of one or more BEDOPS Starch-formatted archive files.

Note

If a single archive is provided as input, it may be reprocessed with specified options. When two or more archives are specified, the output will be the equivalent of a multiset union of the inputs.

Note

This utility does not accept standard input.

6.3.2.3.2.2. Output

The starchcat tool outputs a starch -formatted archive to standard output, which is usually redirected to a file.

Additionally, an optional compression flag specifies if the final starch output should be compressed with either the bzip2 or gzip method (the default being bzip2).

Note

If starch inputs use a different backend compression method, the input stream is re-compressed before integrated into the larger archive. This will incur extra processing overhead.

6.3.2.3.3. Usage

Use the --help option to list all options:

starchcat
 citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
 version:  2.4.36 (typical)
 authors:  Alex Reynolds and Shane Neph

USAGE: starchcat [ --note="..." ]
                 [ --bzip2 | --gzip ]
                 [ --omit-signature ]
                 [ --report-progress=N ] <starch-file-1> [<starch-file-2> ...]

    * At least one lexicographically-sorted, headerless starch archive is
      required.

    * While two or more inputs make sense for a multiset union operation, you
      can starchcat one file in order to update its metadata, recompress it
      with a different backend method, or add a note annotation.

    * Compressed data are sent to standard output. Use the '>' operator to
      redirect to a file.

    Process Flags
    --------------------------------------------------------------------------
    --note="foo bar..."   Append note to output archive metadata (optional).

    --bzip2 | --gzip      Specify backend compression type (optional, default
                          is bzip2).

    --omit-signature      Skip generating per-chromosome data integrity signature
                          (optional, default is to generate signature).

    --report-progress=N   Report compression progress every N elements per
                          chromosome to standard error stream (optional)

    --version             Show binary version.

    --help                Show this usage message.

6.3.2.3.3.1. Per-chromosome data integrity signature

By default, a data integrity signature is generated for each chromosome. This can be used to verify if chromosome streams from two or more Starch archives are identical, or used to test the integrity of a chromosome, to identify potential data corruption.

Generating this signature adds to the computational cost of compression, or an integrity signature may not be useful for all archives. Add the --omit-signature option, if the compression time is too high or the data integrity signature is not needed.

6.3.2.3.3.2. Example

Let’s say we have a set of 23 starch archives, one for each chromosome of the human genome: chr1.starch, chr2.starch, and so on, to chrY.starch. (To simplify this example, we leave out mitochondrial, random, pseudo- and other chromosomes.) We would like to build a new starch archive from these 23 separate files:

$ starchcat chr1.starch chr2.starch ... chrY.starch > humanGenome.starch

The starchcat utility parses the metadata from each of the 23 inputs, determines what data to either simple copy or reprocess, and then it performs the merge. Cleanup is performed afterwards, as necessary, and the output is a brand new starch file, written to humanGenome.starch.

Note

No filtering or processing is performed on extracted BED elements, before they are written to the final output. Thus, it is possible for duplicate BED elements to occur. It would be easy to use the --signature option to validate the expected content of a new Starch archive.

However, the final archive is sorted per sort-bed ordering, so that data extracted from this archive will be ready for use with BEDOPS utilities.

Note

When input archives contain data on disjoint chromosomes, use of starchcat is very efficient as data are simply copied, instead of extracted and re-compressed.