1

I want to be able to block web-crawlers from accessing pages other than page1.

The following should be able to block all directories/file names containing the word page. So something like /localhost/myApp/page2.xhtml should be blocked.

#Disallow: /*page

The following should enable all directories/file names containing page1 to be accessible. So something like /localhost/myApp/page1.xhtml should not be blocked.

#Allow: /*page1

The problem is crawler4j seems to ignoring the astericks which is used for wildcards. Is something wrong with my robots.txt or is the astericks something crawler4j does not interpret by default.

Andy T
  • 136
  • 11

1 Answers1

0

I looked through the crawler4j source code, and it looks like crawler4j does not support wildcards in Allow or Disallow, except in the special case where the asterisk is the last character in the directive. (and then the asterisk is ignored anyway)

plasticinsect
  • 1,702
  • 1
  • 13
  • 23