Hadoop on Azure - file processing on larger number of nodes takes the same amount of time

Question

I ran a wordcount program in python on HDInsight clusters of different size and every time it took the same amount of time. The file size is 600 MB and I ran it on 2, 4 and 8 nodes - every time the same amount of time (not to the second but very close).

I expected the time to change since the file is processed by larger number of nodes as the cluster grows in size... I am wondering if this is the case with a file which is relatively small? Or is there a way to define number of nodes on which the job should be done? - I personally don't think so since the cluster size is set in advance.

Or is it the nature of the wordcount application and the fact that the reducer does the same amount of work?

Or is because it's python - I read somewhere it is said to be slower than java (or scala on spark)?

The same thing happens on Spark clusters - although the nodes number goes up the time does not go down.

Can you please edit your question with information on how large your dataset is? Also: leave out comments like "I heard language x is slower than language y." — David Makogon, Mar 07 '16 at 14:34
How large a dataset would be good? At least 1 gb? Several gb? — piterd, Mar 07 '16 at 14:58
I wrote about python being slow because I would like somebody to refer to this as well — piterd, Mar 07 '16 at 14:58
It depends on so much but if you are using 8 worker nodes, 4 cores each, and a block size of 512mb you'll need a minimum of 16 gb of data to fully utilize each core for one map task.. Python will be marginally slower than scala as after writing your python code the program will convert it to Scala before running. Slow enough to matter? I'm not sure — Andrew Moll, Mar 08 '16 at 05:18
you can look in the core-site.xml file, but I believe it is 512mb for HDI — Andrew Moll, Mar 08 '16 at 21:25

score 0 · Answer 1 · edited May 23 '17 at 10:28

Per my experience, 600MB data size for processing on Hadoop is small. Not all time cost for processing files, because Hadoop need some time to prepare startup for M/R job & data on HDFS.

For a small dataset, it's not necessary for using too more compute nodes. Even, the performance got by a single computer would be higher than the cluster on Hadoop, such as the Hadoop sample wordcount for several small text files.

As I known, the dataset size on Hadoop need to over hundreds of GB level generally for performance advantage, and performance increase with an increase in the number of nodes.

As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that you can know.

Hadoop on Azure - file processing on larger number of nodes takes the same amount of time

1 Answers1