1

I need to process data from a set of streams, applying the same elaboration to each stream independently from the other streams.

I've already seen frameworks like storm, but it appears that it allows the processing of static streams only (i.e. tweets form twitter), while I need to process data from each user separately.

A simple example of what I mean could be a system where each user can track his gps location and see statistics like average velocity, acceleration, burnt calories and so on in real time. Of course, each user would have his own stream(s) and the system should process the stream of each user separately, as if each user had its own dedicated topology processing his data.

Is there a way to achieve this with a framework like storm, spark streaming or samza?

It would be even better if python is supported, since I already have a lot of code I'd like to reuse.

Thank you very much for your help

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137

3 Answers3

0

Using Storm, you can group data using fields-grouping connection pattern if you have a user-id in your tuples. This ensures, that data is partitioned by user-id and thus you get logical substreams. Your code only needs to be able to process multiple groups/substreams, because a single bolt instance gets multiple groups for processing. But Storm supports your use case for sure. It also can run Python code.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thank you very much for your answer. Would it be possible to do some processing on a sliding windows of user's data? Can I be sure data is processed ordered by timestamp (IIRC ordered processing can be guaranteed only using trident, which does not appear to support python)? – Marco DallaG Jun 17 '15 at 10:19
  • I personally use Storm with Java only. For Java, you can do sliding window processing, but need to implement all the logic by yourself. You could also use Trindent, however, you cannot correlate tuples from different batches as far as I know. Thus, it's not real sliding windows. Storm does not give any ordering guarantees either (you might fall back on code in my github repo). I guess, using Python you would need to implement ordering and sliding windows in your own code. (Spark Streaming and Flink Streaming do not support ordered processing either -- I don't know about Samza) – Matthias J. Sax Jun 17 '15 at 10:50
0

In Samza, similar to Storm, one would partition the individual streams on some user ID. This would guarantee that the same processor would see all the events for some particular user (as well as other user IDs that the partition function [a hash, for instance] assigns to that processor). Your description sounds like something that would more likely run on the client's system rather than being a server-side operation, however.

Non-JVM language support has been proposed for Samza, but not yet implemented.

Jakob Homan
  • 2,284
  • 1
  • 13
  • 16
  • Thank you for your answer. We do the processing server-side for a variety of reasons, client side processing is not an option in our case. Samza would be great indeed, if only it supported multilang... If multilang support will ever land in samza I'll surely consider using it. Thank you very much again – Marco DallaG Jul 15 '15 at 12:11
0

You can use WSO2 Stream Processor to achieve this. You can partition the input stream by user-name and process events pertain to each user separately. The processing logic has to be written in Siddhi QL which is a SQL like language.

WSO2 SP also has a python wrapper to, it will allow you do perform administrative tasks such as submitting, editing jobs. But you can't write processing logic using python code.

Sajith Eshan
  • 696
  • 4
  • 17