Implementation of a hadoop (map-reduce) Next-Generation Sequencing pipeline for fast single sample genetic diagnostics.
Before using the tool, be sure that the following has been done:
- A
.tar.gz
archive containing the tools has been uploaded to HDFS containing the required tools (see below). - The needed bwa index files are uploaded to HDFS. Note that these should be in the same directory and have the same prefix. The required files are:
<bwa_reference_file_prefix>.fasta
<bwa_reference_file_prefix>.fasta.amb
<bwa_reference_file_prefix>.fasta.ann
<bwa_reference_file_prefix>.fasta.bwt
<bwa_reference_file_prefix>.fasta.fai
<bwa_reference_file_prefix>.fasta.pac
<bwa_reference_file_prefix>.fasta.sa
- Create a
tools
directory. to store the tools in. - Add the following tools to the created directory:
- Burrows-Wheeler Aligner
- Download it from http://bio-bwa.sourceforge.net/.
- Extract the archive.
- From inside the extracted archive, use
make
(GNU Make) to create an executable file. - Copy the created executable file from the extracted archive to the tools directory.
- Burrows-Wheeler Aligner
- From the directory storing the tools folder, create a
.tar.gz
archive usingtar -zcf tools.tar.gz tools/
The final hierachy of the created tools.tar.gz
should look as follows:
tools.tar.gz
|- tools/
|- bwa
Note: The used tools folder/archive name is arbitrary, though they should have the same name. The rest of the structure should be left intact however!
- Create a local clone of https://github.com/ddcap/halvade.git.
- From within the
halvade/halvade_upload_tools/
directory, useant
(Apache Ant) to create a jar file.- Optional: Before using
ant
, setprivate static int LEVEL = 2;
fromsrc/be/ugent/intec/halvade/uploader/Logger.java
(line 32) to0
for less output to stdout.
- Optional: Before using
The needed file can be found at: dist/HalvadeUploaderWithLibs.jar
- Create a local clone of https://github.com/molgenis/hadoop-pipeline (the most recent commit).
- From within the
hadoop-pipeline/hadoop-pipeline-application/
folder, usemvn install
(Apache Maven) to create a jar file.
The needed file can be found at: target/HadoopPipelineApplicationWithDependencies.jar
- Upload the fastq files to HDFS using the halvade upload tool.
- Example:
hadoop jar HalvadeUploaderWithLibs.jar -D dfs.block.size=134217728 -1 reads1.fastq.gz -2 reads2.fastq.gz -O /path/to/hdfs/output/folder/ -size 124
- Note that a block-size is set using -D (that is slightly larger than the block size defined for the halvade upload tool and must be a mutliple of 512) so that each created file is stored as a single block on HDFS.
- See https://github.com/ddcap/halvade/wiki/Halvade-Preprocessing for more information about the halvade upload tool.
- Example: