2

I've been using Flink and kinesis analytics recently. I have a stream of data and also I need a cache to be shared with the stream.

To share the cache data with the kinesis stream, it's connected to a broadcast stream. The cache source extends SourceFunction and implements ProcessingTimeCallback. Gets the data from DynamoDB every 300 seconds and broadcast it to the next stream using KeyedBroadcastProcessFuction.

But after adding the broadcast stream (in the previous version I hadn't a cache and I was using KeyedProcessFuction for kinesis stream), when I execute it in kinesis analytics, it keeps restarting about every 1000 seconds without any exception!

I have no configuration with this value and the scenario works fine in between!

Could anybody help me what could be the issue?

  • Maybe you can use another framework such as Apache Ignite to share your cache variable across all Flink cluster. This can be a solution if DynamoDB creates backpressure or cache data is too big. – monstereo May 27 '20 at 10:22

1 Answers1

0

My first thought is to wonder if this might be related to checkpointing. Do you have access to the server logs? Flink's logging should make it somewhat clear what's causing the restart.

The reason why I suspect checkpointing is that it occurs at predictable times (and with a long timeout), and using broadcast state can put a lot of pressure on checkpointing. Each parallel instance will checkpoint a full copy of the broadcast state.

Broadcast state has to be kept on-heap, so another possibility is that you are running out of memory.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • I don't have access to the server log files. But I can see them through the cloud watch dashboard. There are no error-level exceptions. But I think you are right. I can see restarting is happening after a while that it tries to trigger checkpoints. Do you have any other suggestions for implementing a simple cache? – Sara Arshad May 20 '20 at 17:39
  • If you can key-partition the cache and store it in keyed state instead, that would help. But if every worker needs all the data, then that won't work. – David Anderson May 20 '20 at 19:50
  • No, I cannot. Every worker needs part of the cache! – Sara Arshad May 20 '20 at 21:37