4

I am working on a simple aggregation that sums totals of events happening on a given resource (see: Calculate totals and emit periodically in flink). With some help I got this to work, but am now hitting another issue.

I am trying to calculate totals for lifetime of a resource, but I am reading events from kinesis stream that has a retention period of 24 hours. As this means that I don't have access to events which happened before that, I need to bootstrap my state from a legacy (batch) system that calculates totals once a day.

Essentially I'd like to somehow bootstrap the state from legacy system (loading stats for yesterday) and then join todays data from kinesis stream on top of that and avoid duplication in the process. This would ideally be a one-off process and application should run from kinesis from then onwards.

I'm happy to provide more details if I missed something.

Thanks

Dalibor Novak
  • 575
  • 5
  • 17

2 Answers2

0

I am facing a similar problem. My current solution is to have two sources - one for the historical data and one for the current data. Then I would combine the sources with a CoFlatmap function. This function must keep track of the incoming records, buffer them and output them in the correct order. Unfortunately this approach requires some work.

Tzanko Matev
  • 259
  • 1
  • 5
-1

What I would recommend is using flink's state to do this (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html). You can have a ValueState object that holds the total value for the resource that just keeps being updated or you can do ListState to hold all the values that come through and you do a recalculation on all of them whenever a new event comes through. Obviously ListState would use more memory that just a single master value, but I don't know what your needs are as well as you do.

Jicaar
  • 1,044
  • 10
  • 26