Skip to content

thomasdarimont/halflife

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Halflife - in-process channels and streams

Halflife is an evolvement and extension of our recent works and ideas on Reactor and Meltdown.

Main features include:

Why Halflife?

Current state-of-art Stream Processing solutions bring enormous development and maintenance overhead and can mostly serve as a replacement for Extract-Load-Transform / Batch Processing jobs (think Hadoop).

Halflife is a lightweight Stream Processing system, which was developed with performance, development and support simplicity in mind. You can scale it on one or many machines without any trouble, building complex processing topologies that are as easy to test as normal operations on Lists. After all, Streams are nothing but eternal sequences.

Halflife can be also used to make asynchronous/concurrent programs simpler, easier to maintain and construct. For example, you could use Channels with Matched Streams in order to build a lightweight websocket server implementation. Another example is asynchronous handlers in your HTTP server, which would work with any driver, no matter whether it offers asynchronous API or no. Other examples include IoT, Machine Learning, Business Analytics and more.

The problem with Machine Learning on Streams is that most of algorithms assume some kind of state: for example, Classification assumes that trained model is available. Having this state somewhere in the Database or Cache brings additional deserialization overhead, and having it in memory might be hard if the system doesn't give you partitioning guarantees (that requests dedicated to same logical entity will end up on the same node).

Halflife approach is simple: develop on one box, for one box, break processing in logical steps for distribution and scale up upon the need. Because of the nature and the layout of streams and data, you will be able to scale it up.

Terminology

Stream is a term coming from Reactive Programming. Stream looks a little like a collection from the consumer perspective, with the only difference that if collection is a ready set of events, stream is an infinite collection. If you do map operation on the stream, map function will see each and every element coming to the stream.

Publisher (generator or producer in some terminologies) is a function or entity that publishes items to the stream. Consumer (or listener, in some terminologies) is a function that is subscribed to the stream, and will be asyncronously getting items that the publisher publishes to the stream.

In many cases, function can simultaneously be a consumer and a producer. For example, map is consumed to the events coming to the stream, and publishes modified events back to the stream.

Topology is a stream with a chain of publishers and producers attached to it. For example, you can have a stream that maps items, incrementing each one of them, then a filter that picks up only even incremented numbers. Of course, in real life applications topologies are much more complex.

Upstream / downstream are used to describe the order of functions in your topologies. For example, if you have a stream that maps items, incrementing each one of them, it serves as an upstream for the following filter function, that consumes events from it. Filter serves as a downstream in this example.

Named and anonymous streams are just the means of explaning the wiring between the parts of the topology. If each item within the named stream has to know where to subscribe and where to publish the resulting events, in anonymous topologies the connection between stream parts is implicit, e.g. parts are simply wired by the systems with unique randomly generated keys.

And a little more description about each one of them:

Named stream topologies

These are the "traditional" key/value subscriptions": you can subscribe to any named Stream. As soon as the message with a certain Key is coming to the stream, it will be matched and passed to all subscribed handlers.

Stream<Integer> intStream = new Stream<>();

intStream.map(Key.wrap("key1"), // subscribe to
              Key.wrap("key2"), // downstream result to
              (i) -> i + 1);    // mapper function
              
intStream.consume(Key.wrap("key2"), // subscribe to result of mapper 
                  (i) -> System.out.println(i));

// send a couple of payloads
intStream.notify(Key.wrap("key1"), 1);
// => 2
intStream.notify(Key.wrap("key1"), 1);
// => 3

Anonymous stream topologies

Anonymous streams are chain of decoupled async stream operations that represent a single logical operation. Anonymous Stream is subscribed to the particular stream, and will create all the additional wiring between handlers in the Anonymous Stream automatically.

Stream<Integer> stream = new Stream<>();
AVar<Integer> res = new AVar<>();

stream.anonymous(Key.wrap("source"))       // create an anonymous stream subscribed to "source"
      .map(i -> i + 1)                     // add add 1 to each incoming value
      .map(i -> i * 2)                     // multiply all incoming values by 2
      .consume(i -> System.out.println(i)); // output every incoming value to stdout

firehose.notify(Key.wrap("source"), 1); 
// => 2

firehose.notify(Key.wrap("source"), 2);
// => 4

The main difference between Java streams and HalfLife streams is the dispatch flexibility and combination of multiple streaming paradigms. You can pick the underlaying dispatcher depending on whether your processing pipeline consists of short or long lived functions and so on.

Matched Lazy Streams

Since the flexible key subscription isn't supported and would be representing linear time, but it is still necessary to be able to subscribe to the keys based on a certain logical function, Matched Lazy Stream is the way to go: as soon as the first key/value pair is sent to the handler, all Matchers are queired for subscription. Handlers that match will be subscribed to the stream, and all the subsequent calls will be on average O(1) lookup time.

stream.matched(key -> key.getPart(0).equals("source")) // create an anonymous stream subscribed to "source"
      .map(i -> i + 1)                                 // add add 1 to each incoming value
      .map(i -> i * 2)                                 // multiply all incoming values by 2
      .consume(i -> System.out.println(i));            // output every incoming value to stdout

firehose.notify(Key.wrap("source", "first"), 1);
// => 2

firehose.notify(Key.wrap("source", "second"), 2);
// => 4

As you can see, streams will be created per-entity, which opens op a lot of opportunities for independent stream processing, entity matching and storing per-entity state.

Independent Per-Entity Streams

No more need to manage streams for multiple logical entities in the same handler, we'll do that for you. This future is used together with Matched Lazy Streams, so every entity stream will have it's own in-memory state.

Atomic State operations

Since entity streams are independent and locks are expensive, it's important to keep the operations lock-free. For that you're provided with an Atom<T> which will ensure lock-free atomic updates to the state.

And since the state for entity are split, you're able to save and restore between restarts of your processing topologies.

Stream<Integer> intStream = new Stream<>(firehose);

intStream.map(Key.wrap("key1"), Key.wrap("key2"), (i) -> i + 1);             
intStream.map(Key.wrap("key2"), Key.wrap("key3"), (Atom<Integer> state) -> { // Use a supplier to capture state in closure 
                return (i) -> {                        // Return a function, just as a "regular" map would do
                  return state.swap(old -> old + i);   // Access internal state
                };
              },
              0);                                      // Pass the initial value for state
intStream.consume(Key.wrap("key3"), ());

intStream.notify(Key.wrap("key1"), 1);
// => 2
intStream.notify(Key.wrap("key1"), 2);
// => 5 
intStream.notify(Key.wrap("key1"), 3);
// => 9

Persistent-collection based handlers

Window, batch and streaming grouping operations are saved in persistent collections. This ensures a good no-copy memory footprint for all the shared state. Every collection used is Persistent, and it's immutable internal entires will be shared between instances.

Uni- and Bi- directional Channels

Channels are much like a queue you can publish to and pull your changes from the queue-like object. This feature is particularly useful in scenarios when you don't need to have neither subscription nor publish hey, and you need only to have async or sync uni- or bi- directional communication.

Stream<Integer> stream = new Stream<>(firehose);
Channel<Integer> chan = stream.channel();

chan.tell(1);
chan.tell(2);

chan.get();
// => 1

chan.get();
// => 2

chan.get();
// => null

Channels can be consumed from streams, too. It is although important to remember that in order to avoid unnecessary state accumulation Channels can either be used as a stream or as a channel:

Stream<Integer> stream = new Stream<>(firehose);
Channel<Integer> chan = stream.channel();

chan.stream()
    .map(i -> i + 1)
    .consume(i -> System.out.println(i));

chan.tell(1);
// => 2

chan.get();
// throws RuntimeException, since channel is already drained by the stream

Channels can also be split to publishing and consuming channels for type safety, if you need to ensure that consuming part can't publish messages and publishing part can't accidentally consume them:

Stream<Integer> stream = new Stream<>(firehose);
Channel<Integer> chan = stream.channel();

PublishingChannel<Integer> publishingChannel = chan.publishingChannel();
ConsumingChannel<Integer> consumingChannel = chan.consumingChannel();

publishingChannel.tell(1);
publishingChannel.tell(2);

consumingChannel.get();
// => 1

consumingChannel.get();
// => 2

License

Copyright © 2014 Alex Petrov Double licensed under the Eclipse Public License or the Apache Public License 2.0.

About

In-process channels and streams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%