0

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link

The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).

I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.

How can I solve this problem ?

I'm using apache nutch 1.11

Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

Best Regards

Yigit Alan
  • 25
  • 4

3 Answers3

2

You want to fetch only pages from a specific domain.

You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.

You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:

+^http://([a-z0-9]*\.)*your.specific.domain.org/
Karsten R.
  • 1,628
  • 12
  • 14
1

Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt

And to crawl specific domain you can use regex-urlfiltee.txt file.

AVINASH
  • 108
  • 1
  • 12
0

Add following property in nutch-site.xml

<property> 
<name>db.ignore.external.links</name> 
<value>true</value> 
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> 
</property>
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121