I ran a wordcount program in python on HDInsight clusters of different size and every time it took the same amount of time. The file size is 600 MB and I ran it on 2, 4 and 8 nodes - every time the same amount of time (not to the second but very close).
I expected the time to change since the file is processed by larger number of nodes as the cluster grows in size... I am wondering if this is the case with a file which is relatively small? Or is there a way to define number of nodes on which the job should be done? - I personally don't think so since the cluster size is set in advance.
Or is it the nature of the wordcount application and the fact that the reducer does the same amount of work?
Or is because it's python - I read somewhere it is said to be slower than java (or scala on spark)?
The same thing happens on Spark clusters - although the nodes number goes up the time does not go down.