starchstrip

The starchstrip utility efficiently pulls out per-chromosome records contained within a BEDOPS Starch-formatted archive and writes the filtered result to a new Starch archive. This utility allows either exclusion or inclusion of one or more specified chromosome names.

Previously, it would be necessary to extract records with unstarch, use awk or similar to filter down to the desired set of records, and recompress with starch. In contrast, starchstrip identifies just the pieces of data of interest within an archive and writes them to a new archive, with an updated metadata payload, avoiding the need for costly and wasteful extraction and re-compression.

Inputs and outputs

Input

The input to starchstrip consists of a BEDOPS Starch-formatted archive file, along with the specification of either --include or --exclude for inclusion or exlusion of chromosome records from the archive. One or more chromosome names are provided as a comma-separated string.

Note

If the chromosome listing contains chromosome names not in the input archive, they will be ignored.

Output

The starchstrip tool writes a starch -formatted archive to the standard output stream, which is usually redirected to a regular file. The output contains the same compressed data from the original file (no extraction or recompression is performed) and so preserves the archive version, compression type, and other archive attributes.

Note

If the archive’s metadata attributes need updating (to gain updated metadata features, for instance, such as data integrity signatures), the starchcat utility should be used to update older archives.

Note

If the specified combination of operation and chromosome names would result in output that is identical to the original file, or output that would be an empty file, starchstrip will exit early with a fatal error.

Usage

Use the --help option to list all options:

starchstrip
  citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
  version:  2.4.29 (typical)
  authors:  Alex Reynolds and Shane Neph

USAGE: starchstrip [ --include | --exclude ] <chromosome-list> <starch-file>

    * Add either the --include or --exclude argument to filter the specified
      <starch-file> for chromosomes in <chromosome-list> for inclusion or
      exclusion, respectively. Note that you can only specify either inclusion
      or exclusion.

    * The <chromosome-list> argument is a comma-separated list of chromosome names
      to be included or excluded. This list is a *required* argument to either of the
      two --include and --exclude options.

    * The output is a Starch archive containing those chromosomes specified for inclusion
      or what chromosomes remain after exclusion from the original <starch-file>. A new
      metadata payload is appended to the output Starch archive.

    * The output is written to the standard output stream -- use the output redirection
      operator to write the result to a regular file, e.g.:

        $ starchstrip --exclude chrN in.starch > out.starch

    * Filtering simply copies over raw bytes from the input Starch archive and
      no extraction or recompression is performed. Use 'starchcat' to update the
      metadata, if new attributes are required.

    Process Flags
    --------------------------------------------------------------------------
    --include <chromosome-list>     Include specified chromosomes from <starch-file>.

    --exclude <chromosome-list>     Exclude specified chromosomes from <starch-file>.

    --version                       Show binary version.

    --help                          Show this usage message.

Example

Let’s say we have an archive containing 23 chromosomes, one for each of the human genome: chr1, chr2, and so on, to chrY. (To simplify this example, we leave out mitochondrial, random, pseudo- and other chromosomes.) As an example, say we want a new Starch archive that contains chromosomes chr4, chr8, and chr17. We can use starchstrip to efficiently write out a new archive with just those three chromosomes:

$ starchstrip --include chr4,chr8,chr17 humanGenome.starch > humanGenome.chrs4_8_and_17.starch

The starchstrip utility parses the metadata from the input humanGenome.starch and uses its details to decide how to write out the subset of chromosomes, along with a metadata payload specific to the three chromosomes. No extraction or recompression is performed; this is as fast as copying just the parts of the file we are interested in.

As a second example, we can instead use the --exclude operand to copy over all chromosomes except those we choose. To continue the example above, we can get the “inverse” of humanGenome.chrs4_8_and_17.starch with the following:

$ starchstrip --exclude chr4,chr8,chr17 humanGenome.starch > humanGenome.all_chrs_except_chrs4_8_and_17.starch