_________ ___ ____ _____
/ ___/ __ \/ _ \/ __ \/ ___/
/ /__/ /_/ / // / /_/ / /__
\___/\____/____/\____/\___/
Welcome to the CODOC project
----------------------------
CODOC is an open file format and API for the lossless and lossy compression of depth-of-coverage (DOC) signals stemming from high-throughput sequencing (HTS) experiments. DOC data is a one-dimensional signal representing the number of reads covering each reference nucleotide in a HTS data set. DOC is highly-valuable information for interpreting (e.g., RNA-seq), filtering (e.g., SNP analysis) or detecting novel features (e.g., structural variants) in sequencing data sets.
CODOC exploits several characteristics of DOC data stemming from HTS data and uses a non-uniform quantization model that preserves DOC signals in low-coverage regions nearly perfectly while allowing more divergence in highly covered regions.
CODOC reduces required file sizes for DOC data up to 3500X when compared to raw representations and about 4-32X when compared to the other methods. The CODOC API supports efficient sequential and random access to compressed data sets.
Some usage scenarios for CODOC:
- extract strand-specific coverage signal (export as WIG file)
- query coverage values at given genomic positions
- extract genomic regions with a given minimum/maximum coverage
- filter single-nucleotide variations in cancer/normal pairs
- pairwise combinations of coverage files (e.g., subtract one signal from another, calculate minimum signal, etc.)
- cross-correlation of coverage signals
- compression of DOC data (e.g, for archiving or transmission). Note that successive compression/decompression of DOC data with CODOC can be used for intentional data quantization.
CODOC source code and releases can be downloaded from GitHub. CODOC uses maven2 as a build tool. For development we recommend the Eclipse IDE for Java developers and the m2e Maven Integration for Eclipse.
The CODOC jars can be built with bin/build-java.sh <VERSION> (version is, e.g., 0.0.1)
Call the CODOC API w/o parameters to get short usage information for the individual commands:
$ java -jar bin/codoc-0.0.1.jar
$ java -jar bin/codoc-0.0.1.jar compress
$ java -jar bin/codoc-0.0.1.jar decompress
$ java -jar bin/codoc-0.0.1.jar tools
The following command compresses coverage data extracted from a SAM/BAM file using a regular grid of height 5 for quantization. Only reads that have the SAM flag for PCR or optical duplicates (1024) unset and reads with a mapping quality > 20 are included.
$ java -jar bin/codoc-0.0.1.jar compress -bam <BAM> -qmethod GRID -qparam 5 -filter "FLAGS^^1024" -filter "MAPQ>20" -o BAM.comp
The following command compresses coverage data extracted from all minus-strand reads in a BAM file losslessly. Only reads with a mapping quality > 20 are included.
$ java -jar bin/codoc-0.0.1.jar compress -bam <BAM> -qparam 0 -filter "STRAND=-" -filter "MAPQ>20" -o <BAM>.comp
The following command prints the header of a compressed file showing the metadata as well as the configuration used for compressing the file.
$ java -jar bin/codoc-0.0.1.jar decompress head -covFile <BAM>.comp
The following command starts an interactive query session for random accessing the compressed file.
$ java -jar bin/codoc-0.0.1.jar decompress query -covFile <BAM>.comp
The following command extracts all regions from the compressed file that have a coverage between MIN and MAX and stores them to a BED file. The given name/description are used for the BED track.
$ java -jar bin/codoc-0.0.1.jar decompress tobed -covFile <BAM>.comp -outFile <BED> -min <MIN> -max <MAX> -name <NAME> -description <DESCRIPTION>
The following command queries a file and scales the results by a linear factor.
$ java -jar codoc-0.0.1.jar decompress query -covFile <BAM>.comp -scale 2.0
The following commands convert a CODOC file to the Wig and then to the BigWig format.
$ java -jar codoc-0.0.1.jar decompress towig -covFile <BAM>.comp -o <BAM>.comp.wig
$ wigToBigWig <BAM>.comp.wig <chrSizes> <BAM>.comp.bw
NOTE: if you run into OutOfMemory errors, you can increase the available heap space of the java JVM (e.g., to 4 GByte) using the following switch: $ java -Xmx4g bin/codoc-0.0.1.jar ...
Filters can be used to restrict the coverage extraction from am SAM/BAM file to reads that match the given filter criteria. Multiple filters are combined by a logical AND. Filters are of the form <FIELD><OPERATOR><VALUE>.
Possible fields:
MAPQ the mapping quality
FLAGS the read flags
STRAND the read strand (+/-)
FOPSTRAND the first-of-pair read strand (+/-)
Other names will be mapped directly to the optional field name in the SAM file. Use e.g., NM for the 'number of mismatches' field. Reads that do not have a field set will be included (See the SAM specification).
Possible operators: <, <=, =, >, >=, ^^ (flag unset), && (flag set)
Example: (do NOT use reads with mapping quality <= 20, or multiple perfect hits)
-filter 'MAPQ>20' -filter 'H0=1'
If you make use of CODOC, please cite:
Niko Popitsch, CODOC: efficient access, analysis and compression of depth of coverage signals., Bioinformatics, 2014, DOI