forked from openreserach/bin2seq
A suite of code to demonstrate big binary data processing in Hadoop, e.g., images, netcdf, hdfs5, etc.
License
aashish24/bin2seq
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
It is a prerequisite to convert files in arbitrary binary formats into a Hadoop SequenceFile format, so they can be split by M/R jobs and processing can be parallelized to distribute nodes. The 'bin2seq' borrows ideas from 'tar2seq' developed by [Stuart Sierra](http://stuartsierra.com/2008/04/24/a-million-little-files) Thanks to the latest Hadoop development like multiple NameNode-support and HDFS-federation (e.g., HDFS-1052), the “large-number-of-small-files” does not pose the same challenge as when 'tar2seq' was developed (prior to Hadoop v0.22). Therefore, bin2seq is a much simpler because we are performing one-binary-file-to-one-sequencefile (bin2seq) conversion, and the benefit of do so is to maintain the same file structure and nomenclature across traditional file system and HDFS. Code-wise, there is little similarity between bin2seq and tar2seq. bin2seq is built by Maven.
About
A suite of code to demonstrate big binary data processing in Hadoop, e.g., images, netcdf, hdfs5, etc.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Java 94.6%
- Shell 4.9%
- Python 0.5%