I want to buffer a datastream in flink. My initial idea is caching 100 pieces of data into a list or tuple and then using insert into values (???)
to insert data into clickhouse in bulk. Do you have better ways to do this?

- 31
- 4
-
What do you mean exactly by buffer datastream? Do you want to collect data in a window (based on a time or amount of events) and them flush the events without aggregating them? – Felipe Sep 16 '21 at 06:03
-
exactly...collect in a window and then flush them – YT Q Sep 16 '21 at 08:33
3 Answers
The first solution that you post works but it is flaky. It can lead to starvation due to a simplistic logic. For instance, let's say that you have a counter of 100
to create a batch. It is possible that your stream never receives 100
events, or it takes hours to receive the 100th
event. Then your basic and working solution can have events stuck in the window batch because it is a count window. In other words, your batch can generate windows of 30
seconds in a high throughput, or windows of 1
hour when your throughput is very low.
DataStream<User> stream = ...;
DataStream<Tuple2<User, Long>> stream1 = stream
.countWindowAll(100)
.process(new MyProcessWindowFunction());
In general, it depends on your use case. However, I would use a time window to make sure that my job always has the flush batch even though there are few or no events on the window.
DataStream<Tuple2<User, Long>> stream1 = stream
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.process(new MyProcessWindowFunction());;

- 7,013
- 8
- 44
- 102
Thanks for all the answers. I use a window function to solve this problem.
SingleOutputStreamOperator<ArrayList<User>> stream2 =
stream1.countWindowAll(batchSize).process(new MyProcessWindowFunction());
Then I overwrite the process function in which the batch size of data is buffered in an ArrayList
.

- 17,736
- 16
- 35
- 75

- 31
- 4
If you want to import data into the database in batches, you can use the window(countWindow or timeWindow)to aggregate the data.

- 294
- 1
- 8