Looking for thoughts on the kind of processing i want to do on messages in a topic. i want to be able to process messages, events in my case, in batches of say 10,000. this is because i am inserting the messages into our snowflake warehouse after transformation. snowflake loads perform better on batch loads. What are some thoughts on building a consumer that will only pull messages from the topic after there are 10,000 messages in the topic. IOW, pull messages from topic once lag hits 10,000. Any thoughts on how to go about building such a consumer?
2 Answers
You'd be better off using the Kafka Connect connector for Snowflake: https://docs.snowflake.net/manuals/user-guide/kafka-connector.html.
Kafka Connect is specifically built to do streaming integration, and this connector is written by Snowflake.

- 30,382
- 3
- 65
- 92
-
Thank you for the suggestion. Would this technique suffer from the "probably once" delivery semantics of S3 events? http://www.hydrogen18.com/blog/aws-s3-event-notifications-probably-once.html – Vish Aug 09 '19 at 14:38
Waiting for 10K records seems feasible, but keep in mind that the more number of records you will wait, the more you will have latency. Besides, if each of your records has an important size, you might generate some burst of traffic.
For this you will have to play with different kind of parameters ( client side and cluster side).
Client side, you'll have to play with :
max.poll.records
fetch.max.bytes
Cluster side :
max.partition.fetch.bytes
message.max.bytes
(broker and topic config)
You'll find all the details about those parameters here : https://kafka.apache.org/documentation/
Also, another stackoverflow post that deals with same kind of question: Increase the number of messages read by a Kafka consumer in a single poll
Yannick

- 1,240
- 2
- 13
- 25