hadoop performance comparison

Question

When is Hadoop supposed to perform faster than a sequential program?

I ran word count on a single node hdfs, and the sequential version that opens the file from hdfs and iterates through each word is actually faster than the hadoop implementation from the tutorial, seems like most of the time was spent on spawning mappers.

Is this supposed to happen? Did I somehow have the wrong setup? Or does Hadoop is not supposed to be faster than a sequential program on a single node instance?? I am confused.

score 3 · Answer 1 · edited May 23 '17 at 11:58

3

What was the size of the data on which you have done this performance comparison? I am guessing it was small.

Hadoop is designed for processing large datasets, where size of data is in hundreds of GB or TB. There is a lot of start up over-head associated with hadoop, which is not the case for sequential program which you have executed.

Check this: Don't use Hadoop - your data isn't that big.

Another reference: MapReduce Job Overhead

edited May 23 '17 at 11:58

Community

1
1

answered Sep 20 '15 at 08:41

YoungHobbit

13,254
9
50
73

Well I already tried a 1GB file but that's still slower than sequential. I just wonder how big is big enough. – JRR Sep 20 '15 at 08:47
What is the number of mapper and reducer it is launching? – YoungHobbit Sep 20 '15 at 08:47
1

Hadoop will win if data volume is in Tera bytes or Peta bytes, distributed in 100s/1000s of servers – Ravindra babu Sep 20 '15 at 12:45

score 0 · Answer 2 · answered Sep 20 '15 at 09:04

There are many parameters to this equation. How many servers/datanodes are used? how many CPU cores and available memory on each? is the data you're reading splittable? (e.g, binary formats aren't splittable and will be read by a single mapper), etc, etc.

There isn't enough such information in your question, and so these are principles you should be aware of when setting your performance expectations.

score 0 · Answer 3 · answered Sep 20 '15 at 17:10

WordCount is a very easy but not efficient example. Use it to validate if your cluster is working but NEVER for performance tests.

Let me explain why.

WordCount parse each line of text and for each word found write to the mapper output the record (WORD, 1). As you cam see, the full output of the mappers will be bigger than the input. And that mappers' bigger output will be the input of the reducers. Then, you need to read more than twice the amount of the input data and write to disk the original input + counters.

Additional to that, you need transfer the mapper output to the reducers. And if you are using only one reducer then the last step will be similar than your sequential job.

The job could be optimized, for example using combiners and multiple reducers.

Hadoop will be faster than local sequential jobs when the amount of the data be bigger than the local resources (ram, HD, cpu) and/or when the cost of initialize the containers and the transfet of data among them is minimized by the number of nodes working in parallel.

hadoop performance comparison

3 Answers3