I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset.
When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either loads the entire dataset into memory on one node and perform most of the tasks on it or loads the dataset into two nodes and spill the tasks on both nodes (based on what I observed on the history server). For both cases, there is sufficient capacity to keep the whole datasets in memory.
I repeated the same experiment multiple times and Spark seemed to alternate between these two ways. Supposedly Spark inherits the input split location as in a MapReduce job. From my understanding, MapReduce should be able to take advantage of two nodes. I don't understand why Spark or MapReduce will alternate between the two cases.
When only one node is used for processing, the performance is worse.