This repository contains a map reduce implementation of a merge between two event datasets with records being linked by item ID and time. We utilize Hadoop Map Reduce's efficiency in distributed sort to group records based on item ID, perform tertiary sort to sort within the groups and iterate through the events to derive insights.
Let's imagine that we have a dataset containing all the taxis in town, with each record showing whether it is hired, available or off-shift. So, it looks like this
ItemID | datetimeStart | datetimeEnd | state |
---|---|---|---|
TaxiA | 2015-10-11 19:02:19 | 2015-10-11 19:15:12 | Hired |
TaxiA | 2015-10-11 19:15:12 | 2015-10-11 19:18:19 | Available |
TaxiA | 2015-10-11 19:18:19 | 2015-10-11 19:40:18 | Hired |
TaxiA | 2015-10-11 19:40:18 | 2015-10-11 20:05:01 | Available |
TaxiB | 2015-10-11 19:11:00 | 2015-10-11 23:00:45 | Off-Shift |
TaxiC | 2015-10-11 19:11:19 | 2015-10-11 19:30:27 | Available |
TaxiE | 2015-10-11 19:16:50 | 2015-10-11 19:46:18 | Hired |
TaxiF | 2015-10-11 19:46:18 | 2015-10-11 20:10:11 | Available |
Suppose we collected another dataset that contains the taxi speeds that were mapped as follows.
speedRange(km/h) | speedAttribute |
---|---|
<= 30 | Low |
between 30 to 60 | Medium |
= 60 | High
The speed dataset looks like this.
datetimeStart | datetimeEnd | itemID | attribute |
---|---|---|---|
2015-10-11 19:20:19 | 2015-10-11 19:21:22 | TaxiA | Medium |
2015-10-11 19:21:22 | 2015-10-11 19:25:38 | TaxiA | High |
2015-10-11 19:25:38 | 2015-10-11 19:30:01 | TaxiA | Medium |
2015-10-11 19:30:01 | 2015-10-11 19:50:01 | TaxiA | Low |
2015-10-11 20:27:38 | 2015-10-11 20:32:19 | TaxiB | Medium |
We want to merge these two datasets in order to relate the taxi speeds to the hired status. So, we want an output like this.
datetimeStart | datetimeEnd | itemID | attribute |
---|---|---|---|
2015-10-11 19:20:19 | 2015-10-11 19:21:22 | TaxiA | Medium |
2015-10-11 19:21:22 | 2015-10-11 19:25:38 | TaxiA | High |
2015-10-11 19:25:38 | 2015-10-11 19:30:01 | TaxiA | Medium |
2015-10-11 19:30:01 | 2015-10-11 19:40:18 | TaxiA | Low |
2015-10-11 19:40:18 | 2015-10-11 19:50:01 | TaxiA | Low |
With this merged dataset, we can begin to analyze whether the taxis drive more slowly when they are available, and whether taxis drive more in the high or medium or low speeds when they have passengers?
Map reduce is a very natural way to implement the merge logic, as it can be distributed by the itemID (for example, taxis). Furthermore, we can utilize the efficient distributed sort in map reduce. Here, I have implemented a tertiary sort.
The basic logic is as follows:
- Group by itemID
- For each record, generate two records, one for start datetime, one for end datetime.
- Sort by datetime within the itemID partitions.
- For each sorted group of itemID, keep a variable called currentState and currentAttribute.
- Iterate over the sorted list of events, changing the variables and writing the records to disk one at a time.
All the best! 😃