0

I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this:

http://1.a.b/
http://2.a.b/

and my regex-urlfilter.txt looks like this:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#+.
+^http://1.a.b/*
+^http://2.a.b/*

I tried the following for the last part:

+^http://([a-z0-9]*\.)*a.b/*

The only site crawled is the first one. All other configuration is default.

I run the following command:

bin/nutch crawl urls -solr http://localhost:8984/solr/ -dir crawl -depth 10 -topN 10

Any ideas?!

Thank you!

marrop
  • 279
  • 1
  • 8
  • Appears that my -depth and -topN were set too low. Also, I added a property to nutch-site.xml: db.ignore.external.links true If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. – marrop Jun 01 '12 at 11:52
  • Yes, insofar as all pages were indexed after I played around mainly with the -depth and -topN values. I made a bunch of other changes to the configuration file as well, which gave major speed improvements. – marrop Jun 26 '12 at 19:25
  • @marrop can you tell what other changes you made to the configuration files to improve the speed? How can i crawl two separate domains at a time? is that possible? – peter Dec 27 '12 at 06:17

1 Answers1

1

Try this in regex-urlfilter.txt :

Old Settings:

# accept anything else
#+.
+^http://1.a.b/*
+^http://2.a.b/*

New Sertting :

# accept anything else
+.
Ravi Singh
  • 2,042
  • 13
  • 29