Nutch regex-urlfilter crawl multiple website

Question

I've seen this link. But my problems is quite different from that.

My seed.txt looks like:

http://a.b.c/ 
http://d.e.f/

And my regex-urlfilter.txt looks like this:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://a.b.c/*

I want to crawl some url like this:

http://a.b.c/index.php?id=1
http://a.b.c/about.php
http://a.b.c/help.html
http://a.b.c/test1/test2/
http://a.b.c/index.php?usv=contact
http://a.b.c/index.php?usv=vdetailpro&id=104&sid=74

and something like that

I've tested by command: bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined and recognized that regex isn't match.

Thanks you!

Note that at the very least, `[?*!@=]` will match the first line due to the question mark. Is this what you were expecting? — Jordan Robinson, Sep 17 '14 at 15:58

score 0 · Answer 1 · answered Dec 04 '14 at 15:07

0

Use these regular expressions in regex-urlfilter.txt

solution1

+^http://([a-z0-9]*\.)*a.b.c/
+^http://([a-z0-9]*\.)*d.e.f/

solution2

+^http://([a-z0-9]*\.)*(a.b.c|d.e.f)/

answered Dec 04 '14 at 15:07

Nwawel A Iroume

1,249
3
21
42

Nutch regex-urlfilter crawl multiple website

1 Answers1