3

With Scrapy, I will scrape a single page (via script and not from console) to check all the links on this page if they are allowed by the robots.txt file.

In the scrapy.robotstxt.RobotParser abstract base class, I found the method allowed(url, user_agent), but I don't see how to use it.

import scrapy

class TestSpider(scrapy.Spider):
    name = "TestSpider"

    def __init__(self):
        super(TestSpider, self).__init__()
               
    def start_requests(self):
        yield scrapy.Request(url='http://httpbin.org/', callback=self.parse)

    def parse(self, response):
        if 200 <= response.status < 300:
            links = scrapy.linkextractors.LinkExtractor.extract_links(response)
            for idx, link in enumerate(links):
                    # How can I check each link is allowed by robots.txt file?
                    # => allowed(link.url , '*')    
                    
                    # self.crawler.engine.downloader.middleware.middlewares
                    # self.crawler AttributeError: 'TestSpider' object has no attribute 'crawler'
        

To run 'TestSpider' spider, in settings.py set

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

Go to the project’s top level directory and run:

scrapy crawl TestSpider

Appreciate any help.

My solution:

import scrapy
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.utils.httpobj import urlparse_cached
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class TestSpider(CrawlSpider):
    name = "TestSpider"

    def __init__(self):
        super(TestSpider, self).__init__()
        self.le = LinkExtractor(unique=True, allow_domains=self.allowed_domains)
        self._rules = [
            Rule(self.le, callback=self.parse)
        ]

    def start_requests(self):
        self._robotstxt_middleware = None
        for middleware in self.crawler.engine.downloader.middleware.middlewares:
            if isinstance(middleware, RobotsTxtMiddleware):
                self._robotstxt_middleware = middleware
break

        yield scrapy.Request(url='http://httpbin.org/', callback=self.parse_robotstxt)

    def parse_robotstxt(self, response):
        robotstxt_middleware = None
        for middleware in self.crawler.engine.downloader.middleware.middlewares:
            if isinstance(middleware, RobotsTxtMiddleware):
                robotstxt_middleware = middleware
                break

        url = urlparse_cached(response)
        netloc = url.netloc
        self._robotsTxtParser = None
        if robotstxt_middleware and netloc in robotstxt_middleware._parsers:
                self._robotsTxtParser = robotstxt_middleware._parsers[netloc]
       
        return self.parse(response)

    def parse(self, response):
        if 200 <= response.status < 300:
            links = self.le.extract_links(response)
            for idx, link in enumerate(links):
                # Check if link target is forbidden by robots.txt
                if self._robotsTxtParser:
                    if not self._robotsTxtParser.allowed(link.url, "*"):
                        print(link.url,' Disallow by robotstxt file')
LeMoussel
  • 5,290
  • 12
  • 69
  • 122

1 Answers1

4

Parser implementations are listed a bit higher on the page than the link you posted.

Protego parser

Based on Protego:

  • implemented in Python
  • is compliant with Google’s Robots.txt Specification
  • supports wildcard matching
  • uses the length based rule

Scrapy uses this parser by default.

So, if you want the same results as scrapy gives by default, use protego.

The usage is as follows (robotstxt being the contents of a robots.txt file):

>>> from protego import Protego
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False

It is also possible to identify and reuse the robots middleware currently in use, but it's probably more trouble than it's worth for most use cases.

Edit:

If you really want to reuse the middleware, your spider has access to downloader middlewares through self.crawler.engine.downloader.middleware.middlewares.
From there, you need to identify the robots middleware (possibly by class name?) and the parser you need (from the middleware's _parsers attribute).
Finally, you'd use the parser's can_fetch() method to check your links.

stranac
  • 26,638
  • 5
  • 25
  • 30
  • Yes I want the same results as scrapy gives by default with Protego. But Protego doesn't have fetch method like reppy to to get the contents of a robots.txt file That's why I want to reuse the robots middleware to get the same result as Scrapy. – LeMoussel Oct 23 '20 at 08:01
  • Edited to include an overview of how to reuse the middleware. – stranac Oct 23 '20 at 08:33
  • `self.crawler` AttributeError: 'TestSpider' object has no attribute 'crawler' – LeMoussel Oct 23 '20 at 09:38
  • Weird, it works for me, and I can't see anything in scrapy's changelog that would suggest this was changed at any point... – stranac Oct 23 '20 at 10:00
  • You should update your question with your current code, at this point I don't know what exactly doesn't work. – stranac Oct 23 '20 at 10:17
  • On my setup, the code in the question throws a `TypeError` when I run it directly and runs fine after I fix the error, so I'm afraid I can't help you more. – stranac Oct 23 '20 at 12:02
  • I finally found out why I had the error: `self.crawler AttributeError:` I continue my investigations in order to propose a solution. – LeMoussel Oct 26 '20 at 09:43