Skip to content

wypb/spark-sorted

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

spark-sorted

Spark-sorted is a library that aims to make non-reduce type operations on very large groups in spark possible, including support for processing ordered values. To do so it relies on Spark's new sort-based shuffle and on never materializing the group for a given key but instead representing it by consecutive rows within a partition that get processed with a map-like (iterator based streaming) operation.

GroupSorted

GroupSorted is a trait for key-value RDDs that also satisfy the following criteria:

  • all rows (key, value pairs) for a given key are consecutive and in the same partition
  • the values can optionally be ordered per key

GroupSorted can be created from a RDD[(K, V)] using the rdd.groupSort operator. To enable the groupSort operator add the following import:

import com.tresata.spark.sorted.PairRDDFunctions._

GroupSorted adds 2 methods to a key-value RDD to process all values records for a given key: mapStreamByKey and foldLeftByKey.

For example say you have a data-set of stock prices, represented as follows:

type Ticker = String
case class Quote(time: Int, price: Double)
val prices: RDD[(Ticker, Quote)] = ...

Assuming you have a function calculates exponential moving averages (EMAs), you could produce time series of EMAs for all tickers as follows:

val emas: Iterator[Double] => Iterator[Double] = ...
prices.groupSort(Ordering.by[Quote, Int](_.time)).mapStreamByKey{ iter => emas(iter.map(_.price)) }

A Java Api is available in package com.tresata.spark.sorted.api.java. Please see the unit tests for usage examples.

Currently this library is alpha stage.

Have fun! Team @ Tresata

About

Secondary sort and streaming reduce for Spark

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 70.3%
  • Java 29.7%