0

I have a datastream (from a csv file) which contains strings and a specific value (a double) in every row. I use keyBy() in Flink to group these values by a specific attribute (country), so I have a different group of tuples (stratum) for every distinct country. I compute the mean and the variance for each group (stratum) in the datastream and I return the quantity mean/variance (μ/σ) for every stratum. In my Flink program I need to sum all the last values of this quantity (mean/variance), i.e. the most recent values, coming from each stratum, as my algorithm runs. In other words, if a quantity γ (mean/variance) results from each stratum, I want to sum all the last values of γ resulting from the computation of γ in every stratum. Can anyone help me to solve this maybe using a specific Flink operator?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
T.D.
  • 11
  • 4

1 Answers1

1

When you perform computations on streams, you never know if or when more data may arrive, so the typical approach is to treat every event as though it might be the last, and so go ahead and produce a result for every event. Which is then made obsolete, or updated, by the result produced in response to the next event.

Unless you are doing windowing, in which case each window can be treated as a finite batch.

In your case, since the input is a CSV file, why not treat this as a batch computation?

But regardless of whether you want batch or streaming, I would suggest looking at Flink's Table and SQL APIs, which have support for computing mean and variance as built-in aggregate functions. You can use the filesystem connector with the old csv format.

Could you do this with the DataStream API? Yes, but ...

If you are doing this computation in windows, then yes, this is straightforward. Just implement your business logic in a ProcessWindowFunction. Its process method will be passed an Iterable containing all of the events assigned to the window, and from there you can compute the mean, variance, etc.

But, without windowing (or batching), no, not really. Computing variance in a purely streaming fashion on unbounded inputs doesn't scale. You must store all of the events in state, and after each event, update the mean and then recompute all of the squared differences between each event and the mean.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thank you very much! Could I succeed that using the Datastream API instead of the Table and SQL APIs? – T.D. Feb 19 '20 at 12:23
  • Table API currently builds on DataStream API, so everything can also be implemented manually with respective [windows and aggregations](https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#aggregatefunction). – Arvid Heise Feb 19 '20 at 12:29