  _________  ___  ____  _____
 / ___/ __ \/ _ \/ __ \/ ___/
/ /__/ /_/ / // / /_/ / /__  
Welcome to the CODOC project


CODOC is an open file format and API for the lossless and lossy compression of depth-of-coverage (DOC) signals stemming from high-throughput sequencing (HTS) experiments. DOC data is a one-dimensional signal representing the number of reads covering each reference nucleotide in a HTS data set. DOC is highly-valuable information for interpreting (e.g., RNA-seq), filtering (e.g., SNP analysis) or detecting novel features (e.g., structural variants) in sequencing data sets.

CODOC exploits several characteristics of DOC data stemming from HTS data and uses a non-uniform quantization model that preserves DOC signals in low-coverage regions nearly perfectly while allowing more divergence in highly covered regions.

CODOC reduces required file sizes for DOC data up to 3500X when compared to raw representations and about 4-32X when compared to the other methods. The CODOC API supports efficient sequential and random access to compressed data sets.

Some usage scenarios for CODOC:

  • extract strand-specific coverage signal (export as WIG file)
  • query coverage values at given genomic positions
  • extract genomic regions with a given minimum/maximum coverage
  • filter single-nucleotide variations in cancer/normal pairs
  • pairwise combinations of coverage files (e.g., subtract one signal from another, calculate minimum signal, etc.)
  • cross-correlation of coverage signals
  • compression of DOC data (e.g, for archiving or transmission). Note that successive compression/decompression of DOC data with CODOC can be used for intentional data quantization.

Getting CODOC

CODOC source code and releases can be downloaded from GitHub. CODOC uses maven2 as a build tool. For development we recommend the Eclipse IDE for Java developers and the m2e Maven Integration for Eclipse.

The CODOC jars can be built with bin/ <VERSION> (version is, e.g., 0.0.1)

Usage Examples

Call the CODOC API w/o parameters to get short usage information for the individual commands:

$ java -jar bin/codoc-0.0.1.jar
$ java -jar bin/codoc-0.0.1.jar compress
$ java -jar bin/codoc-0.0.1.jar decompress
$ java -jar bin/codoc-0.0.1.jar tools

The following command compresses coverage data extracted from a SAM/BAM file using a regular grid of height 5 for quantization. Only reads that have the SAM flag for PCR or optical duplicates (1024) unset and reads with a mapping quality > 20 are included.

$ java -jar bin/codoc-0.0.1.jar compress -bam <BAM> -qmethod GRID -qparam 5 -filter "FLAGS^^1024" -filter "MAPQ>20" -o BAM.comp

The following command compresses coverage data extracted from all minus-strand reads in a BAM file losslessly. Only reads with a mapping quality > 20 are included.

$ java -jar bin/codoc-0.0.1.jar compress -bam <BAM> -qparam 0 -filter "STRAND=-" -filter "MAPQ>20" -o <BAM>.comp

The following command prints the header of a compressed file showing the metadata as well as the configuration used for compressing the file.

$ java -jar bin/codoc-0.0.1.jar decompress head -covFile <BAM>.comp

The following command starts an interactive query session for random accessing the compressed file.

$ java -jar bin/codoc-0.0.1.jar decompress query -covFile <BAM>.comp

The following command extracts all regions from the compressed file that have a coverage between MIN and MAX and stores them to a BED file. The given name/description are used for the BED track.

$ java -jar bin/codoc-0.0.1.jar decompress tobed -covFile <BAM>.comp -outFile <BED> -min <MIN> -max <MAX> -name <NAME> -description <DESCRIPTION>

The following command queries a file and scales the results by a linear factor.

$ java -jar codoc-0.0.1.jar decompress query -covFile <BAM>.comp -scale 2.0

The following commands convert a CODOC file to the Wig and then to the BigWig format.

$ java -jar codoc-0.0.1.jar decompress towig -covFile <BAM>.comp -o <BAM>.comp.wig
$ wigToBigWig <BAM>.comp.wig <chrSizes> <BAM>

NOTE: if you run into OutOfMemory errors, you can increase the available heap space of the java JVM (e.g., to 4 GByte) using the following switch: $ java -Xmx4g bin/codoc-0.0.1.jar ...

CODOC Filters

Filters can be used to restrict the coverage extraction from am SAM/BAM file to reads that match the given filter criteria. Multiple filters are combined by a logical AND. Filters are of the form <FIELD><OPERATOR><VALUE>.

Possible fields:

    MAPQ    	the mapping quality
    FLAGS   	the read flags
    STRAND  	the read strand (+/-)
    FOPSTRAND       the first-of-pair read strand (+/-)

Other names will be mapped directly to the optional field name in the SAM file. Use e.g., NM for the 'number of mismatches' field. Reads that do not have a field set will be included (See the SAM specification).

Possible operators: <, <=, =, >, >=, ^^ (flag unset), && (flag set)

Example: (do NOT use reads with mapping quality <= 20, or multiple perfect hits)

    -filter 'MAPQ&gt;20' -filter 'H0=1'


If you make use of CODOC, please cite:

Niko Popitsch, CODOC: efficient access, analysis and compression of depth of coverage signals., Bioinformatics, 2014, DOI