Skip to content

aashish24/bin2seq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

It is a prerequisite to convert files in arbitrary binary formats into a Hadoop SequenceFile format,
so they can be split by M/R jobs and processing can be parallelized to distribute nodes.  
 
The 'bin2seq' borrows ideas from 'tar2seq' developed by [Stuart Sierra](http://stuartsierra.com/2008/04/24/a-million-little-files)

Thanks to the latest Hadoop development like multiple NameNode-support and HDFS-federation (e.g., HDFS-1052), 
the “large-number-of-small-files” does not pose the same challenge as when 'tar2seq' was developed (prior to Hadoop v0.22).  
Therefore, bin2seq is a much simpler because we are performing one-binary-file-to-one-sequencefile (bin2seq) 
conversion, and the benefit of do so is to maintain the same file structure and nomenclature across 
traditional file system and HDFS.  

Code-wise, there is little similarity between bin2seq and tar2seq. bin2seq is built by Maven. 

About

A suite of code to demonstrate big binary data processing in Hadoop, e.g., images, netcdf, hdfs5, etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 94.6%
  • Shell 4.9%
  • Python 0.5%