0

I am using Spark to perform some operations on my data. I need use a auxiliary dictionary to help my data operations.

streamData = sc.textFile("path/to/stream")
dict = sc.textFile("path/to/static/file")
//some logic like:
//if(streamData["field"] exists in dict)
// do something

My question is: is the dict in memory all the time or does it need to be loaded and unloaded each time Spark is working on a batch?

Thanks

derek
  • 9,358
  • 11
  • 53
  • 94

1 Answers1

-1

The dict will remain persisted in memory unless it needs to be evicted for another object(s) that needs the memory at runtime. If you need to reuse it later, you should do dict.cache() after initializing it. You can also persist the RDD to disk with .persist(DISK_ONLY) if it's very large and untenable for caching in memory. This post has a useful summary on RDD mechanics.

Community
  • 1
  • 1
Paul Back
  • 1,269
  • 16
  • 23