10

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?

It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.

Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.

halfer
  • 19,824
  • 17
  • 99
  • 186
MaatDeamon
  • 9,532
  • 9
  • 60
  • 127
  • http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming (disclaimer: I'm the author). – Yuval Itzchakov Aug 17 '18 at 12:24
  • But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation – MaatDeamon Aug 17 '18 at 14:34
  • I tried to read this doc here https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-StateStore.html but it goes against the blog that state there is no store per executor anymore. – MaatDeamon Aug 17 '18 at 14:39
  • Could you please explain a little bit how the key value store works. if it is HDFS backed that things are store on disck distributively i think, but once it is in memory i wonder how the different core on the respecting executor can access the same view of the data if it is not distributed while in memory. Can you help on this – MaatDeamon Aug 17 '18 at 14:41
  • It works as follows: You have a `ConcurrentHashMap` placed on each one of you executors. Data is partitioned to each one of these maps by the defined partitioner in Spark. Every micro batch, the HDFS backed state store will take all the updated keys and store them asynchronously in HDFS. Every once in a while compactation will happen and the delta files saved will turn to a "snapshot" file. Additionally, deletion will happen. – Yuval Itzchakov Aug 18 '18 at 10:57
  • @yuval So each executor does not see the complete state across executors, but only it's local state? Say for instance, I have a State that basically stores a count of each word seen in a stream of words. According to your explanation, each executor will store a local count per word in its ConcurrentHashMap, but not be able to "see" the global count per word across all executors. – AbhinavChoudhury Mar 15 '19 at 16:35
  • @AbhinavChoudhury That is correct. – Yuval Itzchakov Mar 15 '19 at 20:10

1 Answers1

4

There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS. While In-Memory HashMap is for data storage, HDFS is for fault rolerance. The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)

But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation

You are correct that there is no such documentation available. I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

chandan prakash
  • 323
  • 2
  • 4