I want to cache some data (ndarrays) locally on worker nodes to do some comparison with ndarray distributed from incoming RDDs from Spark streaming. What is the best way to do it?
Since I want to compare ndarrays stored in my files with each single ndarray passed in from Spark streaming. It doesn't seem like I can load those data into an RDD, since I cannot go through another RDD inside the map function of one other RDD. And I tried loading them to a list on master node and broadcast them to worker nodes. But I got an error that broadcast variable is not iterable when I try to go through them and comparing with the incoming data.