2

I'm a newbie for Hadoop.

Recently I just make an implementation of WordCount example.

But when I run this programs on my single node with 2 input files , just 9 word, it cost nearly 33 second to do such !!! so crazy, and it makes me so confusing !!!

Can any one tell me is this normal or some???

How can I fix this problem? Remember, I just create 2 input files with 9 word in it.

Submit Host Address: 127.0.0.1
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Fri Aug 05 14:27:22 CST 2011
Finished at: Fri Aug 05 14:27:53 CST 2011
Finished in: 30sec

Buhake Sindi
  • 87,898
  • 29
  • 167
  • 228
jackalope
  • 1,554
  • 3
  • 17
  • 37

2 Answers2

3

This is not unusual. Hadoop comes into effect with large datasets. What you are seeing is probably the initial startup time of Hadoop.

Otto Allmendinger
  • 27,448
  • 7
  • 68
  • 79
  • But why it is so slow for very small input ?? – jackalope Aug 05 '11 at 07:55
  • Think about it; it might take 30s to set up the platform but this is neither here not there when you are processing gigabytes or terabytes of data. It's not designed for small amounts of data. – Adrian Mouat Aug 05 '11 at 08:00
  • OK,this time i run it with 15 files , about to 28.9K, and it cost 1mins, 11sec !!! And this time , set up ? – jackalope Aug 05 '11 at 08:10
  • 1
    Real testings starts with size over multiple giga bytes. I have started tests with files which where at least 20 GiB in size. Cause under this size it's not a real challenge for Hadoop. Are you working with pseudo-distributed mode or have you setup a small cluster with at least three data-nodes and one name node? – khmarbaise Aug 05 '11 at 08:23
  • I don't believe this either! I have already started the Hadoop daemons before submitting a job to it. What's the start-mapred.sh script for otherwise? And it **still** takes more than 30 seconds :( – akira Aug 26 '11 at 13:27
3

Hadoop is not efficient for very very small jobs, as it takes more time for the JVM Startup, process initialization and others. Though, it can be optimized to some extent by enabling JVM reuse.

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse

Also, there is some work going on this in Apache Hadoop

https://issues.apache.org/jira/browse/MAPREDUCE-1220

Not sure in which release this will be included or what the state of the JIRA is.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117