7

While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read.

To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo (here).

The expected result would be that first the records are generated in a deterministic way, next they are written to Kafka - this concludes the write pipeline. As for reading and correctness checking - first, the data is read from the topic and after being decoded into String representations, a hashcode of the whole PCollection is calculated (For details, check KafkaIOIT.java).

During the testing I ran into several problems:

  1. When the predermined number of records is read from the Kafka topic, the hash is different each time.

  2. Sometimes not all the records are read and the Dataflow task waits for the input indefinitely, occasionally throwing exceptions.

I believe there are two possible causes of this behavior:

either there is something wrong with the Kafka cluster configuration

or KafkaIO behaves erratically on high data volumes, duplicating and/or dropping records.

I found a Stack answer that I believe might explain the first behavior: link - if messages are delivered more than once, it's obvious that the hash of the whole collection would change.

In this case, I don't really know how to configure KafkaIO.Write in Beam to produce exactly once.

This leaves the issue of messages being dropped unsolved. Can you help?

Pablo
  • 10,425
  • 1
  • 44
  • 67
Michael
  • 173
  • 1
  • 11

1 Answers1

0

As mentioned in the comments, a practical appraoch would be to start small and see if this is a problem of scaling up. E.g. starting with 10 messages, and multiplying the number till you see something strange.

Furthermore, one thing that stands out is that you send data to a topic, and check the hash after reading from the topic. However, you do not mention partitions, is it possible that you are in fact seeing different results because there are multiple partitions?

Kafka guarantees order within a partition.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122