0

I am stuck with very weird issue in spark structure streaming. Whenever I am shutting down the stream and restart again it again process already processed record.

I tried to use spark.conf.set("spark.streaming.stopGracefullyOnShutdown", True) but still I have the issue.

Any suggestion how to get rid of this issue.

Thanks,

Deepak
  • 31
  • 3

1 Answers1

1

spark.conf.set("spark.streaming.stopGracefullyOnShutdown", True) only helps in shutting down the StreamingContext gracefully on JVM shutdown rather than immediately. It has to do nothing with stream data.

With the given information where you didn't mention the nature of stream data and how you are passing it (either after a certain interval of time or in batches), you need to take care of the process.

You can try below approach:

Process the data in batches and provide sufficient time to Spark Job to process that batch. Eg: if a batch of 100 records take 60 seconds to process, give extra 5-10 seconds for safer side.

Terminate the streaming after a certain period of time.

Most importantly, make sure that data which is coming from source for batch processing should be kept in separate container (location) and the data which has already been processed should be in different container. Do not keep before and after data in same container.

I hope this should work.

Utkarsh Pal
  • 4,079
  • 1
  • 5
  • 14