0

I am using IntervalJoin function to join two streams within 10 minutes. As below:

labelStream.intervalJoin(adLogStream)
           .between(Time.milliseconds(0), Time.milliseconds(600000))
           .process(new processFunction())
           .sink(kafkaProducer)

labelStream and adLogStream are proto-buf class that are keyed by Long id.

Our two input-streams are huge. After running about 30minutes, the output to kafka go down slowly, like this: enter image description here

When data output begins going down, I use jstack and pstack sevaral times to get these: enter image description here enter image description here

It seems the program is stucked in rockdb's seek. And I find that some rockdb's srt file are accessed slowly by iteration. enter image description here

I have tried several ways:

1)Reduce the input amount to half. This works well.
2)Replace labelStream and adLogStream with simple Strings. This way, data amount will not change. This works well.
3)Use PredefinedOptions like SPINNING_DISK_OPTIMIZED and SPINNING_DISK_OPTIMIZED_HIGH_MEM. This still fails.
4)Use new versions of rocksdbjni. This still fails.

Can anyone give me some suggestions? Thank you very much.

user2928444
  • 49
  • 1
  • 7

1 Answers1

0

A few thoughts:

  • You could ask on the flink-user mailing list -- in general, operational questions like this are more likely to elicit knowledgeable responses on the mailing list than on stack overflow.

  • I've heard that if RocksDB is given more off-heap memory to work with, it can help because RocksDB will use it for caching. Sorry, but I don't know how any details of how to go about configuring this.

  • Perhaps increasing the parallelism would help.

  • If it's possible to do so, it might be interesting to try running with the heap-based state backend instead, just to see how much of the pain is caused by RocksDB.

David Anderson
  • 39,434
  • 4
  • 33
  • 60