I am writing a crawler and for this I am implementing the robots.txt parser, I am using the standard lib robotparser.
It seems that robotparser is not parsing correctly, I am debugging my crawler using Google's robots.txt.
(Following examples are from IPython)
In [1]: import robotparser
In [2]: x = robotparser.RobotFileParser()
In [3]: x.set_url("http://www.google.com/robots.txt")
In [4]: x.read()
In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False
In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False
In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False
It's funny because sometimes it seems to "work" and sometimes it seems to fail, I also tried the same with the robots.txt from Facebook and Stackoverflow. Is this a bug from robotpaser
module? Or am I doing something wrong here? If so, what?
I was wondering if this bug had anything related