2

I have more data in a kafka topic but when i extract data using my pyspark application (which I use to extract from different kafka topics), I am getting only 1 row extracted. Previously I had extracted data from the same topic using the same pyspark application/code without any issues.

One thing I want to highlight is that, I had tried extracting data from the topic multiple times from the same databricks notebook and also from different databricks notebook so my doubt here is if I might have extracted the data from same topic from two different notebooks at the same time in same databricks instance and it should have caused some issue due to which I am facing this issue. How to troubleshoot and fix this issue?

I am new to kafka & pyspark

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
rakk
  • 47
  • 7
  • Do you see any errors in the logs? , when you say different databricks nb, it depends, if both start with the same group.id or different, if the same group.id then each will read a specific topic partitions (in case you have a multi-partition topic), if different group.id, then each will read the entire data. – Karim Tawfik Mar 01 '23 at 10:10
  • Currently I am not writing the output of extracted data to any location (just extracting and seeing the output of the extracted data) and no errors, my pyspark application just extract one row from the topic. But I have used the same pyspark application to extract data from differet topic and it extracts data perfectly. As you said I can investgate on partition thing as I have 2 partitions in the topic that i am facing issue with, Thanks for answering!! – rakk Mar 01 '23 at 10:18
  • Please show your code as a [mcve] – OneCricketeer Mar 01 '23 at 15:05

1 Answers1

1

Previously I had extracted data from the same topic using the same pyspark application/code without any issues.

If you're using the same kafka.group.id, then consumed offsets are being tracked by this value, and you'll need to reset the consumer group offsets using Kafka tools. Otherwise, you'll only consume new data after the offsets that were previously consumed and committed.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks for your inputs, I am new to kafka and I am getting your point on group.id. My issue got resolved when I create a job out of the same code (i used in databricks notebook) using dbx for databricks. I just changed the stream name, new destination path. I think since I changed the stream name ,destination and also created a job from a new notebook (using dbx), this became a new consumer with different group.id. – rakk Mar 13 '23 at 04:44
  • By default, Spark will create new group id for every execution. Databricks/Notebooks don't matter. – OneCricketeer Mar 13 '23 at 20:08
  • Ok, Thanks again, I will make a note of this. – rakk Mar 14 '23 at 07:28