I have a Spark job that is generating a set of results with statistics. My number of work items are more than slave count. So I am doing more than one processing per slave.
I cache
results after generating RDD
objects to be able to reuse them as I have multiple write operations: one for result objects and another for statistics. Both write operations use saveAsHadoopFile
.
Without caching Spark reruns the job again per each write operation and that is taking a long time and redoing the same execution twice (more if I had more writes).
With caching I am hitting the memory limit. Some of previously calculated results are lost during caching and I am seeing "CacheManager:58 - Partition rdd_1_0 not found, computing it"
messages. Spark eventually goes into an infinite loop as it tries to cache more results while losing some others.
I am aware of the fact that Spark has different storage levels for caching. Using memory + disk would solve our problem. But I am wondering whether we can write down files right in the worker without generating RDD
objects or not. I am not sure if that is possible though. Is it?