2

I have a very large number of keys and limited cluster size. I am using mapWithState to update my states. As new data comes in the number of keys increases. When I went to the storage tab of the spark UI MapWithStateRDD is always stored in memory.

In line 109 of the source code MapWithStateDstream.Scala the method persist is called when storage level is set to MEMORY_ONLY. Does this mean my application will crash if I have too many keys?

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
Rishi
  • 148
  • 1
  • 7

1 Answers1

2

when i went to the storage tab of the spark UI MapWithStateRDD is always stored in memory

Spark uses it's own implementation of a HashMap called OpenHashMapBasedStateMap to internally store the state. This means that the values are stored in memory, and not in a persistent store.

Does this mean my application will crash if I have too many keys?

This means that your cluster needs to have sufficient resources to store all the keys simultaneously since the state is persisted in-memory. If you're limited, you'll need to optimize the state you save to make sure they all fit in. Otherwise, consider using an external persistent storage for your state.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321