3

Gurus - I am new to Apache Beam and trying to implement, what seems to be a pretty straight forward use case. I have stock data and I need to find a rolling average-price of the stock over the past 10 transactions.

Now since there is no fixed time duration within which 10 transactions can occur (some times it may be a few milli-seconds and other times it may be several seconds), I don't think I can use the time based Windowing. I had two questions:

  1. Is this a valid use case for Beam or am I missing a point here?
  2. Is there a reasonably simple/legitimate/non-hack way to write a Windowing functions/class (in python sdk) that can window data based on number of records?

I have seen recommendations of faking timestamp data on the records so that each arriving record seems like it was created say one second apart but I see two problems with this:

a. This is truly a hack solution which seems such a misfit for something like beam that is supposed to be so powerful and elegantly architectured

b. What is the point of using high-performance Beam pipeline(server-less) if you are going to stifle the performance in the first place by using a program to sequentially add the fake-time stamps

Wonder if windowing within Beam may be a more elegant solution

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
hpep
  • 37
  • 3
  • Would this be a batch or streaming job? If it's the former I think the custom timestamp approach would be easier than writing a custom windowing function. Default Beam sliding windows and sessions only accept time (duration/gap) specifications. If it's streaming take into account that the Python SDK support is experimental as of now and you might need to use the Java one depending on the implemented use case. I'm thinking of a global window with a repeated data-driven trigger and discarding panes that fires after at least 10 elements, but it might not guarantee exactly 10 transactions. – Guillem Xercavins Jun 06 '18 at 10:19
  • Hi Guillem - I am trying to stay away from distinguishing between whether the data received is batch is steaming and that is for two reasons: a. it could be either and b. with beam, isn't it one of its key value propositions that, batch or streaming, you should be able to process both consistently? – hpep Jun 10 '18 at 05:57
  • 1
    I would suggest using the [`GroupIntoBatches`](https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html) transform that will buffer values and output them when it receives 10 (batchSize) elements per key and window. It can work with batch or streaming jobs. Assign the same key to all elements if you just want a global average. The downside is that it does not seem to be available with the Python SDK. You can find a recent example [here](https://stackoverflow.com/questions/50817107/). – Guillem Xercavins Jun 15 '18 at 11:37

0 Answers0