0

I am facing a problem while using consumer.poll() method .After fetching data by using poll() method consumer won't have any data to commit so Please help me to remove specific number of lines from the kafka topic .

surya
  • 21
  • 6
  • Don't understand your question. However, Kafka topics are append only and you cannot delete anything manually. The only way data is deleted is via log retention or log compaction. – Matthias J. Sax Nov 21 '16 at 21:13
  • Thank you for responding @Matthias J. Sax . But actually my problem is while I am using consumer.poll() It will fetch the particular amount of data but if in case the our program will fail then the new server will start reading from the first line on wards and If I put "auto commit " is True then the data will be lost if one server fails – surya Nov 22 '16 at 06:03

1 Answers1

0

You need to make sure, that the data is fully processed before you commit it to avoid "data loss" in case if consumer failure.

Thus, if you enable auto.commit, make sure that you process all data completely after a poll() before you issue the next poll() because each poll() implicitly commits all data from its previous poll().

If this is not possible, you should disable auto.commit and commit manually after data got completely processed via consumer.commit(...). For this, keep in mind that you do no need to commit each message individually, and that a commit with offset X implicitly commits all messages with offsets < X (e.g., after processing message of offset 5, you commit offset 6 -- the committed offset is not the last successfully processed message, but the next message you want to process). And a commit of offset 6, commits all messages with offset 0 to 5. Thus, you should not commit offset 6 before all messages with smaller offset got completely processed.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thank you but let us assume my poll() method fetches 1000 lines and it will automatically remove 1000 lines from the consumer and after that If at 500th line the server fails then another server will process first 500 lines again. But In my situation we are dividing into buckets like after every 100 lines we create a bucket so we are sending to process after that bucket . so In this case data in the output will be duplicated. – surya Nov 22 '16 at 10:55
  • Yes. Kafka only guarantees at-least-once processing and there might be duplicates in the case of failure. There is no exactly-once processing yet. IIRC, this is also discussed here: http://docs.confluent.io/current/clients/consumer.html – Matthias J. Sax Nov 22 '16 at 17:52
  • Btw: you are not "deleting" anything... The data is still in Kafka after you did a commit and you can reread it if you `seek()` to the corresponding offsets(s). – Matthias J. Sax Nov 22 '16 at 17:54
  • How do we know the particular offset value to seek() .If it fails unfortunately and how the other consumer can read the offset value and use it in seek() method – surya Nov 23 '16 at 06:15
  • That is exactly the problem... You would need to remember the offset somehow and make two operation atomic: remember the offset and write the result record. There is no build it support in Kafka to do this. One workaround would be, to embed the input message offset into the result message. On recovery, you read the latest written result record to get the latest successfully processed record. But this "pollutes" you result record -- however, if you can hide this from the real downstream consumers you are good to go. – Matthias J. Sax Nov 23 '16 at 21:49