Location of HadoopPartition

Question

I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset.

When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either loads the entire dataset into memory on one node and perform most of the tasks on it or loads the dataset into two nodes and spill the tasks on both nodes (based on what I observed on the history server). For both cases, there is sufficient capacity to keep the whole datasets in memory.

I repeated the same experiment multiple times and Spark seemed to alternate between these two ways. Supposedly Spark inherits the input split location as in a MapReduce job. From my understanding, MapReduce should be able to take advantage of two nodes. I don't understand why Spark or MapReduce will alternate between the two cases.

When only one node is used for processing, the performance is worse.

score 0 · Answer 1 · answered Jul 04 '15 at 23:52

0

When your loading the data in Spark you can specify the minimum number of splits and this will force Spark to load the data on multiple machines (with the textFile api you would add minPartitions=2 to your call.

answered Jul 04 '15 at 23:52

Holden

7,392
1
27
33

Location of HadoopPartition

1 Answers1