Skip to content

avijitgupta/contrail-bio

Repository files navigation

Contrail
http://contrail-bio.sf.net
===================================================

The first step towards analyzing a previously unsequenced organism is 
to assemble the reads by merging similar reads into progressively 
longer sequences. New assemblers such as Velvet and Euler  attempt to 
solve the assembly problem by constructing, simplifying, and traversing 
the de Bruijn graph of the read sequences. Nodes in the graph represent 
substrings of the reads, and directed edges connect consecutive substrings. 
Genome assembly is then modeled as finding an Eulerian tour through the 
graph, although repeats may lead to multiple possible tours. As such, 
assemblers primarily focus on correcting errors, reconstructing unambiguous 
regions, and resolving short repeats. These assemblers have successfully 
assembled small genomes from short reads, but have had limited success 
scaling to larger mammalian-sized genomes, in part, because they 
require constructing and manipulating graphs far larger than can fit into
memory.


Addressing this limitation, we have developed a new assembly program Contrail,
that uses Hadoop for de novo assembly of large genomes from short sequencing
reads. Similar to other leading short read assembler, Contrail relies on the
graph-theoretic framework of de Bruijn graphs. However, unlike these programs,
which require large RAM resources, Contrail relies on Hadoop to iteratively
transform an on-disk representation of the assembly graph, allowing an in depth
analysis even for large genomes. Preliminary results show Contrail’s contigs
are of similar size and quality to those generated by Velvet when applied to
small (bacterial) genomes, but provides vastly superior scaling capabilities
when applied to large genomes. We are also developing extensions to Contrail to
efficiently compute a traditional overlap-graph based assembly of large genomes
within Hadoop, strategy that will be especially valuable as read lengths
increase beyond 100bp.


Contrail enables de novo assembly of large genomes from short reads by bridging
research in computation biology with research in high performance computation.
This combination is essential in light of the large data sets involved, and has
the potential to unlock discoveries of critical magnitude. Whereas the
published analysis of the African and Asian human individuals used read mapping
to discover conserved regions and regions with small polymorphisms, de novo
assembly has the unique potential to also discover large scale polymorphisms
between these individuals and the reference human genome. Mapping the
large-scale differences is an important step towards better understanding of
our own biology, and may reveal previously unknown characteristics of the human
genome related to health or disease. Furthermore, a short read assembler for
large genomes is also essential for sequencing the vast numbers of complex
organisms that have never been sequenced before, and will directly contribute
to new biological knowledge. 



Release History
===================================================


Version 0.8.2
Oct 13, 2010
===================================================
Initial public release

Releases

No releases published

Packages

No packages published

Languages