1

I'm trying to crawl user defined websites but not able to crawl the site where robots.txt is preventing the crawling. That's fine but I want to get the response where I can show to user that "the site you have entered doesn't allow to crawl due to robots.txt".

There are other 3 types of prevention for which I got the code and handling accordingly but only this exception (i.e. prevention by robots.txt) which I cannot handle. So, please let me know if there is any way to handle the case and show the appropriate error message.

I'm using Python 3.5.2 and Scrapy 1.5

Dhaval
  • 901
  • 3
  • 8
  • 26

1 Answers1

0

You should use ROBOTSTXT_OBEY

ROBOTSTXT_OBEY=False

More information about RobotsTxtMiddleware:

This middleware filters out requests forbidden by the robots.txt exclusion standard.

To make sure Scrapy respects robots.txt make sure the middleware is enabled and the ROBOTSTXT_OBEY setting is enabled.

If Request.meta has dont_obey_robotstxt key set to True the request will be ignored by this middleware even if ROBOTSTXT_OBEY is enabled.

parik
  • 2,313
  • 12
  • 39
  • 67