Get the response if site didn't crawl due to robots.txt

Question

I'm trying to crawl user defined websites but not able to crawl the site where robots.txt is preventing the crawling. That's fine but I want to get the response where I can show to user that "the site you have entered doesn't allow to crawl due to robots.txt".

There are other 3 types of prevention for which I got the code and handling accordingly but only this exception (i.e. prevention by robots.txt) which I cannot handle. So, please let me know if there is any way to handle the case and show the appropriate error message.

I'm using Python 3.5.2 and Scrapy 1.5

score 0 · Accepted Answer · answered May 30 '18 at 08:37

0

You should use ROBOTSTXT_OBEY

ROBOTSTXT_OBEY=False

More information about RobotsTxtMiddleware:

This middleware filters out requests forbidden by the robots.txt exclusion standard.

To make sure Scrapy respects robots.txt make sure the middleware is enabled and the ROBOTSTXT_OBEY setting is enabled.

If Request.meta has dont_obey_robotstxt key set to True the request will be ignored by this middleware even if ROBOTSTXT_OBEY is enabled.

answered May 30 '18 at 08:37

parik

2,313
12
39
67

1

Technically this solution is correct, however not sure about legally – Dhaval Jun 06 '18 at 15:47

Get the response if site didn't crawl due to robots.txt

1 Answers1