I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?
I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?
I find this topic functionally easy to understand, but the manuals harder to follow technically and there are improvements always in the offing.
My take:
- There is a
ContextCleaner
that is running on the Driver for every Spark App.- It gets created immediately started when the
SparkContext
commences.- It is more about all sorts of objects in Spark.
- The
ContextCleaner
thread cleans RDD, shuffle, and broadcast states, Accumulators usingkeepCleaning
method that runs always from this class. It decides which objects needs eviction due to no longer being referenced and these get placed on a list. It calls various methods, such asregisterShuffleForCleanup
. That is to say a check is made to see if there are noalive root
objects pointing to a given object; if so, then that object is eligible for clean-up, eviction.context-cleaner-periodic-gc
asynchronously requests the standard JVM garbage collector. Periodic runs of this are started whenContextCleaner
starts and stopped whenContextCleaner
terminates.- Spark makes use of the standard Java GC.
This https://mallikarjuna_g.gitbooks.io/spark/content/spark-service-contextcleaner.html is a good reference next to the Spark official docs.