1

I want to buffer a datastream in flink. My initial idea is caching 100 pieces of data into a list or tuple and then using insert into values (???) to insert data into clickhouse in bulk. Do you have better ways to do this?

YT Q
  • 31
  • 4
  • What do you mean exactly by buffer datastream? Do you want to collect data in a window (based on a time or amount of events) and them flush the events without aggregating them? – Felipe Sep 16 '21 at 06:03
  • exactly...collect in a window and then flush them – YT Q Sep 16 '21 at 08:33

3 Answers3

2

The first solution that you post works but it is flaky. It can lead to starvation due to a simplistic logic. For instance, let's say that you have a counter of 100 to create a batch. It is possible that your stream never receives 100 events, or it takes hours to receive the 100th event. Then your basic and working solution can have events stuck in the window batch because it is a count window. In other words, your batch can generate windows of 30 seconds in a high throughput, or windows of 1 hour when your throughput is very low.

DataStream<User> stream = ...;
DataStream<Tuple2<User, Long>> stream1 = stream
    .countWindowAll(100)
    .process(new MyProcessWindowFunction());

In general, it depends on your use case. However, I would use a time window to make sure that my job always has the flush batch even though there are few or no events on the window.

DataStream<Tuple2<User, Long>> stream1 = stream
    .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
    .process(new MyProcessWindowFunction());;
Felipe
  • 7,013
  • 8
  • 44
  • 102
1

Thanks for all the answers. I use a window function to solve this problem.

SingleOutputStreamOperator<ArrayList<User>> stream2 = 
     stream1.countWindowAll(batchSize).process(new MyProcessWindowFunction());

Then I overwrite the process function in which the batch size of data is buffered in an ArrayList.

Peter Csala
  • 17,736
  • 16
  • 35
  • 75
YT Q
  • 31
  • 4
0

If you want to import data into the database in batches, you can use the window(countWindow or timeWindow)to aggregate the data.

liliwei
  • 294
  • 1
  • 8