Why does Nutch only run the fetch step on one Hadoop node, when the cluster has 5 nodes total?

Question

I'm running Nutch on a Elastic MapReduce, with 3 worker nodes. I'm using Nutch 1.4, with the default configuration it ships with (after adding a user agent).

However, even though I'm crawling a list of 30,000 domains the fetching step is only run from one worker node, even though the parsing step runs on all three.

How do I get it to run the fetch step from all three nodes?

*EDIT* The problem was that I needed to set the mapred.map.tasks property to the size of my Hadoop cluster. You can find this documented here

score 3 · Accepted Answer · answered Apr 22 '12 at 09:27

By default nutch partitions urls based in their hosts. The corresponding property in nutch-default.xml is:

<property>
  <name>partition.url.mode</name>
  <value>byHost</value>
  <description>Determines how to partition URLs. Default value is 'byHost', 
  also takes 'byDomain' or 'byIP'. 
  </description>
</property>

Please verify the value on your setup.

I think that your problem can be diagnosed by getting answers for these questions:

How many mappers were created for the fetch job ? it might be possible that there were multiple mappers spawned and all of them got finished early except for one.
What was the topN value used in generate command ? If this is low, then despite of having 30K pages, very less will be sent to the fetch phase.
Had you used numFetchers option in the generate command ? This controls the number of maps created for the fetch job.
How many reduces were generated for the generate-partition job ? If this value is 1, then only a single map will be created in fetch phase. The output of generate partition is given to fetch phase. Number of part files created by generate (ie. reducers for generate) is equal to the number of maps created for the fetch job.
Whats the setting for mapred.map.tasks on your hadoop ? whats the corresponding value for reduce ?

1) It appears that only 1 mappers is being generated for the fetch step 2) I didn't specify a topN 3) It doesn't seem like the numFetchers option is valid for the Crawl class, in the nutch-1.4.job archive 4) About 200 5) I didn't set a value, so it's just the default that EMR has. — cberner, Apr 22 '12 at 18:56
What if as you mentioned in the step #4 that there is only a single reducer in the generate partition job? I am also running in a situation where the fetch is only running in a single map task , and this is preceded by a single reducer in the generate-partition job. How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job? — user1965449, Aug 28 '14 at 05:06

Why does Nutch only run the fetch step on one Hadoop node, when the cluster has 5 nodes total?

1 Answers1