How to prevent crawling external links with apache nutch?

Question

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link

The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).

I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.

How can I solve this problem ?

I'm using apache nutch 1.11

Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

Best Regards

score 2 · Answer 1 · answered Apr 06 '16 at 13:04

You want to fetch only pages from a specific domain.

You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.

You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:

+^http://([a-z0-9]*\.)*your.specific.domain.org/

score 1 · Answer 2 · answered Apr 06 '16 at 16:01

1

Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt

And to crawl specific domain you can use regex-urlfiltee.txt file.

answered Apr 06 '16 at 16:01

AVINASH

108
1
12

Yes I'm using crawl script with 200 depth. When I edit the refer url filter results are the same as before. – Yigit Alan Apr 06 '16 at 16:57
1

If you re running the crawl script, i suggest delete your crawldb folder and then rerun.. – AVINASH Apr 06 '16 at 17:16
and also make sure your seed url page has other links which crawler can crawl into – AVINASH Apr 06 '16 at 17:21

score 0 · Answer 3 · answered Aug 09 '16 at 12:47

Add following property in nutch-site.xml

<property> 
<name>db.ignore.external.links</name> 
<value>true</value> 
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> 
</property>

How to prevent crawling external links with apache nutch?

3 Answers3