Nutch skip url containing #

Question

I am learning Nutch. I have set up nutch and started crawling sites. But one thing I am unable to figure out is how to restrict url containing # as several duplication is going on due to this #. I have checked the regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

If I add # to this line conceptually this should work but after adding # It's not working. Is it due to # used to comment lines? If so how to fix it.

I was just gonna say that. Added backslash and bingo. How silly. Thanks for your response. :) — Jay Chakra, Feb 14 '15 at 23:26
Nice one @JayChakra. I've formalised the answer if you're happy to accept. — Robert Bain, Feb 14 '15 at 23:28
@RobertBain: Is there some way in nutch to parse the HTML distinctively like say body to index in body field of solr, title in title field of solr and so on. Any lead is highly appriciated — Jay Chakra, Feb 14 '15 at 23:35
I'm afraid I came to this question through the regex topic, rather than nutch. It's definitely worth asking a new question. — Robert Bain, Feb 14 '15 at 23:44
@RobertBain: Thanks, I have found solution on SO here http://stackoverflow.com/questions/12338967/how-to-parse-html-with-nutch-and-index-specific-tag-to-solr — Jay Chakra, Feb 15 '15 at 11:03

score 3 · Accepted Answer · answered Feb 14 '15 at 23:27

3

Escape the # using a backslash.

answered Feb 14 '15 at 23:27

Robert Bain

9,113
8
44
63

Nutch skip url containing #

1 Answers1