Merging-Event-Datasets

This repository contains a map reduce implementation of a merge between two event datasets with records being linked by item ID and time. We utilize Hadoop Map Reduce's efficiency in distributed sort to group records based on item ID, perform tertiary sort to sort within the groups and iterate through the events to derive insights.

Let's imagine that we have a dataset containing all the taxis in town, with each record showing whether it is hired, available or off-shift. So, it looks like this

ItemID	datetimeStart	datetimeEnd	state
TaxiA	2015-10-11 19:02:19	2015-10-11 19:15:12	Hired
TaxiA	2015-10-11 19:15:12	2015-10-11 19:18:19	Available
TaxiA	2015-10-11 19:18:19	2015-10-11 19:40:18	Hired
TaxiA	2015-10-11 19:40:18	2015-10-11 20:05:01	Available
TaxiB	2015-10-11 19:11:00	2015-10-11 23:00:45	Off-Shift
TaxiC	2015-10-11 19:11:19	2015-10-11 19:30:27	Available
TaxiE	2015-10-11 19:16:50	2015-10-11 19:46:18	Hired
TaxiF	2015-10-11 19:46:18	2015-10-11 20:10:11	Available

Suppose we collected another dataset that contains the taxi speeds that were mapped as follows.

speedRange(km/h)	speedAttribute
<= 30	Low
between 30 to 60	Medium

= 60 | High

The speed dataset looks like this.

datetimeStart	datetimeEnd	itemID	attribute
2015-10-11 19:20:19	2015-10-11 19:21:22	TaxiA	Medium
2015-10-11 19:21:22	2015-10-11 19:25:38	TaxiA	High
2015-10-11 19:25:38	2015-10-11 19:30:01	TaxiA	Medium
2015-10-11 19:30:01	2015-10-11 19:50:01	TaxiA	Low
2015-10-11 20:27:38	2015-10-11 20:32:19	TaxiB	Medium

We want to merge these two datasets in order to relate the taxi speeds to the hired status. So, we want an output like this.

datetimeStart	datetimeEnd	itemID	attribute
2015-10-11 19:20:19	2015-10-11 19:21:22	TaxiA	Medium
2015-10-11 19:21:22	2015-10-11 19:25:38	TaxiA	High
2015-10-11 19:25:38	2015-10-11 19:30:01	TaxiA	Medium
2015-10-11 19:30:01	2015-10-11 19:40:18	TaxiA	Low
2015-10-11 19:40:18	2015-10-11 19:50:01	TaxiA	Low

With this merged dataset, we can begin to analyze whether the taxis drive more slowly when they are available, and whether taxis drive more in the high or medium or low speeds when they have passengers?

Map reduce is a very natural way to implement the merge logic, as it can be distributed by the itemID (for example, taxis). Furthermore, we can utilize the efficient distributed sort in map reduce. Here, I have implemented a tertiary sort.

The basic logic is as follows:

Group by itemID
For each record, generate two records, one for start datetime, one for end datetime.
Sort by datetime within the itemID partitions.
For each sorted group of itemID, keep a variable called currentState and currentAttribute.
Iterate over the sorted list of events, changing the variables and writing the records to disk one at a time.

All the best! 😃

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src/org/myorg		src/org/myorg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/org/myorg

src/org/myorg

README.md

README.md

Repository files navigation

Merging-Event-Datasets

About

Releases

Packages

Languages

chengyongyeo/Merging-Event-Datasets

Folders and files

Latest commit

History

src/org/myorg

src/org/myorg

README.md

README.md

Repository files navigation

Merging-Event-Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages