Apache DataFu

Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.

It consists of two libraries:

Apache DataFu Pig: a collection of user-defined functions for Apache Pig
Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce

DataFu is currently undergoing incubation with Apache. A mirror of the official git repository can be found on GitHub at https://github.com/apache/incubator-datafu.

For more information please visit the website:

http://datafu.incubator.apache.org/

If you'd like to jump in and get started, check out the corresponding guides for each library:

Blog Posts

Presentations

Videos

Introduction to Apache DataFu @ ApacheCon 2014

Other Resources

An interesting example of using Quantile from DataFu can be found in the Hadoop Real-World Solutions Cookbook.

From Around the Web

Papers

Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)

Getting Help

Bugs and feature requests can be filed here. For other help please see the discussion group.

Building the Code

The Apache DataFu Pig library can be built by running the command below. More information about working with the source code can be found in the DataFu Pig Contributing Guide.

ant jar

The Apache DataFu Pig library can be built by running the commands below. More information about working with the source code can be found in the DataFu Hourglass Contributing Guide.

cd contrib/hourglass
ant jar

Name		Name	Last commit message	Last commit date
Latest commit History 384 Commits
.settings		.settings
cobertura		cobertura
contrib/hourglass		contrib/hourglass
eclipselibs		eclipselibs
examples		examples
ivy		ivy
licenses		licenses
plugin/java/org/adrianwalker/multilinestring		plugin/java/org/adrianwalker/multilinestring
src		src
staticlibs		staticlibs
test/pig/datafu/test/pig		test/pig/datafu/test/pig
tools		tools
.classpath.template		.classpath.template
.factorypath.template		.factorypath.template
.gitignore		.gitignore
.project		.project
.travis.yml		.travis.yml
CONTRIBUTORS		CONTRIBUTORS
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.xml		build.xml
changes.md		changes.md
check-license-headers.sh		check-license-headers.sh
ivy.xml		ivy.xml
ivysettings.xml		ivysettings.xml
releasing.md		releasing.md
settings.xml.template		settings.xml.template
test.sh		test.sh
test_in_background.sh		test_in_background.sh

License

jhartman/datafu

Folders and files

Latest commit

History

Repository files navigation

Apache DataFu

Blog Posts

Presentations

Videos

Other Resources

From Around the Web

Papers

Getting Help

Building the Code

About

Resources

License

Stars

Watchers

Forks

Languages