Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
It consists of two libraries:
- Apache DataFu Pig: a collection of user-defined functions for Apache Pig
- Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce
DataFu is currently undergoing incubation with Apache. A mirror of the official git repository can be found on GitHub at https://github.com/apache/incubator-datafu.
For more information please visit the website:
If you'd like to jump in and get started, check out the corresponding guides for each library:
- Introducing DataFu
- DataFu: The WD-40 of Big Data
- DataFu 1.0
- DataFu's Hourglass: Incremental Data Processing in Hadoop
- A Brief Tour of DataFu
- Building Data Products at LinkedIn with DataFu
- Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)
- DataFu @ ApacheCon 2014
An interesting example of using Quantile from DataFu can be found in the Hadoop Real-World Solutions Cookbook.
- DataFu Enters Incubation Status at Apache
- DataFu: Open Source Apache Pig UDFs by LinkedIn
- LinkedIn Opens DataFu: A Library for Working with Hadoop and Pig
Bugs and feature requests can be filed here. For other help please see the discussion group.
The Apache DataFu Pig library can be built by running the command below. More information about working with the source code can be found in the DataFu Pig Contributing Guide.
ant jar
The Apache DataFu Pig library can be built by running the commands below. More information about working with the source code can be found in the DataFu Hourglass Contributing Guide.
cd contrib/hourglass
ant jar