2

I am learning Nutch. I have set up nutch and started crawling sites. But one thing I am unable to figure out is how to restrict url containing # as several duplication is going on due to this #. I have checked the regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[*!@] 

If I add # to this line conceptually this should work but after adding # It's not working. Is it due to # used to comment lines? If so how to fix it.

Jay Chakra
  • 1,481
  • 1
  • 13
  • 29

1 Answers1

3

Escape the # using a backslash.

Robert Bain
  • 9,113
  • 8
  • 44
  • 63