0

I'm reading data from a kafka topic which has lots of data. Once flink starts reading, it starts up fine and then crashes after some time, when backpressure hits 100%, and then goes in an endless cycle of restarts.

My question is shouldn't flink's backpressure mechanism come into play and reduce consumption from topic till inflight data is consumed or backpressure reduces, as stated in this article: https://www.ververica.com/blog/how-flink-handles-backpressure? Or do i need to give some config which I'm missing? Is there any other solution to avoid this restart cycle when backpressure increases?

I've tried configs

taskmanager.network.memory.buffer-debloat.enabled: true
taskmanager.network.memory.buffer-debloat.samples: 5

My modules.yaml has this config for transportation

spec:
  functions: function_name
  urlPathTemplate: http://nginx:8888/function_name
  maxNumBatchRequests: 50
  transport:
    timeouts:
      call: 2 min
      connect: 2 min
      read: 2 min
      write: 2 min
David Anderson
  • 39,434
  • 4
  • 33
  • 60
Singh3y
  • 336
  • 1
  • 7

1 Answers1

1

You should look in the logs to determine what exactly is causing of the crash and restart, but typically when backpressure is involved in a restart it's because a checkpoint timed out. You could increase the checkpoint timeout.

However, it's better if you can reduce/eliminate the backpressure. One common cause of backpressure is not providing Flink with enough resources to keep up with the load. If this is happening regularly, you should consider scaling up the parallelism. Or it may be that the egress is under-provisioned.

I see you've already tried buffer debloating (which should help). You can also try enabling unaligned checkpoints.

See https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpointing_under_backpressure/ for more information.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Hi david, thanks for the suggestion. But, when i'm enabling unaligned checkpointing. I'm getting this error `The main method caused an error: Invalid configuration: execution.checkpointing.unaligned; StateFun currently does not support unaligned checkpointing`. I'm using statefun 3.2 Also, as stated in ververica's article. Shouldn't flink reduce consumption under backpressure? – Singh3y Sep 14 '22 at 11:33
  • 1
    I wasn't aware of this limitation, but it isn't too surprising, given statefun's reliance on iterations. – David Anderson Sep 14 '22 at 16:29
  • 1
    Flink does naturally throttle the sources in response to backpressure. But this doesn't guarantee that checkpoints will be able to complete in a timely fashion. Have you been able to verify that the failures are caused by checkpoint timeouts? – David Anderson Sep 14 '22 at 16:31
  • Yes, failures were caused due to checkpoint timeouts. So, I think i have to increase resources as you have suggested. Will try it. Isn't there any way to configure if backpressure is hit certain then stop consuming till it's cleared up? – Singh3y Sep 21 '22 at 07:18
  • "Isn't there any way to configure if backpressure is hit certain then stop consuming till it's cleared up?" That happens automatically. – David Anderson Sep 22 '22 at 09:19