How to work with RobotsTxtMiddleware in Scrapy framework?

Question

Scrapy framework have RobotsTxtMiddleware. It needs to make sure Scrapy respects robots.txt. It need's to set ROBOTSTXT_OBEY = True in settings, then Scrapy will respect robots.txt policies. I did it and run spider. In debug I Have seen request to http://site_url/robot.txt.

What does this mean, how it works?
How can I work with response?
How can I see and understand rules from robot.txt?

can anyone help me with this scrapy issue: https://stackoverflow.com/questions/47843194/unable-to-scrape-myntra-api-data-using-scrapy-framework-307-redirect-error — Suruchi Babbar, Dec 16 '17 at 07:34

score 4 · Accepted Answer · answered Sep 17 '15 at 10:07

It's normal that the spyder request robot.txt that's where the rules are.

robot.txt is basically a blacklist of urls that you should not visit/crawl which use glob/regex kind of syntax to specify the forbidden urls.

Scapy will read the robot.txt and translate those rules to code. During the crawl when the spyder meets an url it first validates against the rules generated from the robot.txt that the URL can be visited. If the URL is not blacklisted by robot.txt scrapy will visit the url and deliver a Response.

robot.txt is not only blacklisting urls, it also provide the speed at which the crawl can happen. Here is an example robot.txt:

User-Agent: * 
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /submitlink?
Disallow: /threads?
Crawl-delay: 30

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

My answer is based on what Scrapy docs state:

It detects and filters out requests intended to paths that are specified in robots.txt as not allowed (disallow) for the Spider User-Agent.
Response processing is the same. You just won't receive Response objects from those URLs in your callback functions, since there will not be Request for them (those requests were already filtered)
You can look at the RobotsTxtMiddleware code here: https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py to understand how it parses robots.txt files, but if you want to understand how robots.txt rules work you should take a look at:

http://www.robotstxt.org/norobots-rfc.txt

How to work with RobotsTxtMiddleware in Scrapy framework?

2 Answers2