How to cache data on Google Dataproc worker nodes

Question

I want to cache some data (ndarrays) locally on worker nodes to do some comparison with ndarray distributed from incoming RDDs from Spark streaming. What is the best way to do it?

Since I want to compare ndarrays stored in my files with each single ndarray passed in from Spark streaming. It doesn't seem like I can load those data into an RDD, since I cannot go through another RDD inside the map function of one other RDD. And I tried loading them to a list on master node and broadcast them to worker nodes. But I got an error that broadcast variable is not iterable when I try to go through them and comparing with the incoming data.

"But I got an error that broadcast variable is not iterable when I try to go through them and comparing with the incoming data" - did you forgot to use `value` (`bd_array = sc.broadcast(np.arange(100)); bd_array.value.shape`)? — Alper t. Turker, May 04 '18 at 17:21
Was comment by @user9613318 your answer? In that case he can add it as an answer. — VictorGGl, May 15 '18 at 11:34

score 0 · Answer 1 · answered May 30 '18 at 13:09

Issue here was that you need to use the value() method to read the actual value of the broadcasted variable. Following the example in the comment by @user9613318:

bd_array = sc.broadcast(np.arange(100))

This will create a numpy array for that range and broadcast it to all workers. If you try to use the variable with just 'bd_array' you'll get a broadcast variable class which has other methods such as persist, destroy, etc. This is not iterable. If you read it with 'bd_array.value' you'll get back the broadcasted numpy array which can be iterated on (some docs here)

How to cache data on Google Dataproc worker nodes

1 Answers1