I have configured Apache Nutch 2.3.1 with Hadoop/Hbase ecosystem. Following are the configuration information.
<configuration>
<property>
<name>db.score.link.internal</name>
<value>5.0</value>
</property>
<property>
<name>enable.domain.check</name>
<value>true</value>
</property>
<property>
<name>http.timeout</name>
<value>30000</value>
</property>
<property>
<name>generate.max.count</name>
<value>200</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>http.agent.name</name>
<value>My Private Spider Bot</value>
</property>
<property>
<name>http.robots.agents</name>
<value>My Private Spider Bot</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
</property>
</configuration>
There are 3 compute nodes where Nutch job runs. Now the problem is that after using 5000 domains as starting seed, nutch is only fetching few domains and there are alot of new domains as well where only a single document is fetched. I want nutch should fairley fetch all domains. Also I have give a score of 5 to inlinks but my tweeking shows that there is no impact of this property at all.
I have post process crawled data and found that there are total 14000 domains in database (hbase) and out of these, more than 50% domains are not crawled by Nutch ( their documents have fetch status code 0x01 ). Why it so. How to change nutch to consider new domains as well i.e., it should be fair to all domains somehow for fetching.