GitHub - aashish24/bin2seq: A suite of code to demonstrate big binary data processing in Hadoop, e.g., images, netcdf, hdfs5, etc.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
lib		lib
scripts		scripts
src		src
.classpath		.classpath
.project		.project
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README		README
pom.xml		pom.xml
run.sh		run.sh

Repository files navigation

It is a prerequisite to convert files in arbitrary binary formats into a Hadoop SequenceFile format,
so they can be split by M/R jobs and processing can be parallelized to distribute nodes.  
 
The 'bin2seq' borrows ideas from 'tar2seq' developed by [Stuart Sierra](http://stuartsierra.com/2008/04/24/a-million-little-files)

Thanks to the latest Hadoop development like multiple NameNode-support and HDFS-federation (e.g., HDFS-1052), 
the “large-number-of-small-files” does not pose the same challenge as when 'tar2seq' was developed (prior to Hadoop v0.22).  
Therefore, bin2seq is a much simpler because we are performing one-binary-file-to-one-sequencefile (bin2seq) 
conversion, and the benefit of do so is to maintain the same file structure and nomenclature across 
traditional file system and HDFS.  

Code-wise, there is little similarity between bin2seq and tar2seq. bin2seq is built by Maven.