Skip to content
/ dataflux Public
forked from sumanthn/dataflux

Ingest , process, query, report large amounts of data

License

Notifications You must be signed in to change notification settings

NRaoV/dataflux

 
 

Repository files navigation

DataFlux

data flux is a POC for lambda architecture. flux is combination of various technologies to enable ingestion, processing, querying data @ scale. Apache Kafka is used to initially commit the incoming data, to be read later by batch processor to persist into HBase. Storm also consumes the data from Kafka for real time processing.

A data generator is provided which can feed http access log for processing. It supports conversations in web transactions enable better semantics.

Schematic Diagram

Modules

dataflux-datagenerator -- generate web transactions data for processing dataflux-producer -- Kafka producer dataflux-batchpersist -- batch layer for persisting data into HBase dataflux-persister -- a common layer to be used in persistence

Some key features

  • Datageneration tries to be as close as real world, zipfian distribution for response codes, conversation flows with unique session id, generate bursts
  • Flow control in batch persist to handle slow consumer problem
  • Replay failed batches

About

Ingest , process, query, report large amounts of data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%