0

I have a Spark job that is generating a set of results with statistics. My number of work items are more than slave count. So I am doing more than one processing per slave.

I cache results after generating RDD objects to be able to reuse them as I have multiple write operations: one for result objects and another for statistics. Both write operations use saveAsHadoopFile.

Without caching Spark reruns the job again per each write operation and that is taking a long time and redoing the same execution twice (more if I had more writes).

With caching I am hitting the memory limit. Some of previously calculated results are lost during caching and I am seeing "CacheManager:58 - Partition rdd_1_0 not found, computing it" messages. Spark eventually goes into an infinite loop as it tries to cache more results while losing some others.

I am aware of the fact that Spark has different storage levels for caching. Using memory + disk would solve our problem. But I am wondering whether we can write down files right in the worker without generating RDD objects or not. I am not sure if that is possible though. Is it?

mert
  • 1,942
  • 2
  • 23
  • 43

1 Answers1

0

It turns out that writing files inside a Spark worker process is not different than writing a file in a Java process. Write operation just requires just creating functionality to serialize and save files to HDFS. This question has several answers on how to do it.

saveAsHadoopFile is just a convenient way of doing it.

Community
  • 1
  • 1
mert
  • 1,942
  • 2
  • 23
  • 43