7

I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.

I've found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching :

  • Heritrix (there is an open issue on this subject)
  • Crawler4j (looks like the same implementation as Heritrix)
  • jrobotx

Does anyone know of a java library that can do this ?

Brent Worden
  • 10,624
  • 7
  • 52
  • 57
clement
  • 81
  • 5

1 Answers1

1

Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.

In particular, the issue NUTCH-1455 looks to be quite related to your needs:

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"

Perhaps its worth it to try/patch/submit the fix :)

aldrinleal
  • 3,559
  • 26
  • 33