I found this link https://gist.github.com/BenFradet/c47c5c7247c5d5d0f076 which shows an implementation where in spark, broadcast variable is being updated. Is this a valid implementation meaning will executors see the latest value of broadcast variable?
1 Answers
The code you are referring to is using Broadcast.unpersist() method. If you check Spark API Broadcast.unpersist() method it says "Asynchronously delete cached copies of this broadcast on the executors. If the broadcast is used after this is called, it will need to be re-sent to each executor." There is an overloaded method unpersist(boolean blocking) which will block until unpersisting has completed. So it depends how are you using Broadcast variable in your Spark application. In spark there is no auto-re-broadcast if you mutate a broadcast variable. Driver has to resend it. Spark documentation says you shouldn't modify broadcast variable (Immutable) to avoid any inconsistency in processing at executor nodes but there are unpersist() and destroy() methods available if you want to control the broadcast variable's life cycle. Please refer spark jira https://issues.apache.org/jira/browse/SPARK-6404

- 14,783
- 2
- 50
- 66
-
Thanks for this info. My use-case is I want to download some Key-Value data from remote server, store it as hashmap, send/broadcast that to all executors for local lookup. Then after say 2 minutes check the remote server for new data, if there is new data, get it, add to hashmap and send/broadcast to all executors for local lookup with NEW data. I think I can achieve this using 2 broadcast variables. – sunillp Sep 28 '16 at 11:17
-
First time broadcast_var1 with initial data will be broadcasted. Later, on data change, broadcast_var2 with new data will be broadcasted, and broadcast_var1 will be unpersisted(blocking=true). Executors that were earlier using "broadcast_var1" will now get exception and based on that switch to "broadcast_var2". Next time, on data change, same thing will repeat but the roles of two broadcast variables will change. DO YOU THINK THIS IS POSSIBLE / VALID ? – sunillp Sep 28 '16 at 11:23
-
1Broadcast variable should be used to send large data one time to executors and not frequently. By calling unpersist(blocking=true) in every 2 minutes, performance of stream processing will hamper. It also depends on logic you have on value of Broadcast variable. How big is your key-value data. Can't you put it inside a closure so it can be serialized to the executors? – abaghel Sep 28 '16 at 14:48
-
My keys could be around 50K and values are some nested structures, altogether around 50 MB. I am not aware of the closure option that you mentioned. Can you throw more light on this and do you think it will solve my problem - i.e. send updated Key-Value data to all workers on regular intervals. Also can you give me some example or pointer which explains this. – sunillp Sep 28 '16 at 15:50
-
Also updates will happen rarely may be once in an hour and not every 2 minutes. – sunillp Sep 28 '16 at 17:03
-
That should work. I have edited my answer and added spark jira link. Hope this helps. – abaghel Sep 28 '16 at 23:57