Is there a way to block certain URLs while a spider is running

Question

I am writing a spider using the scrapy framework (I am using the crawl spider to crawl every link in a domain) to pull certain files from a given domain. I want to block certain URLs where the spider is not finding files. For example, if the sider visits a URL with /news/ path in it one hundred times and doesn't find a file I want it to stop looking in /news/

I have already tried updating the self.rules variable when it finds and the path that doesn't yield file but this did not work and it continued crawling URLs with that path

This is the function that I am trying to use to update the rule

 def add_block_rule(self, match):
        new_rule = match.replace('/','\/')
        new_rule = f'/(.*{new_rule}.*)'
        if new_rule in self.deny_rules:
            return
        print(f'visted {match} to many times with out finding a file')
        self.deny_rules.append(new_rule)
        self.rules = (
            Rule( 
                LinkExtractor(
                    allow_domains=self.allowed_domains,
                    unique=True,
                    deny=self.deny_rules,),
                callback='parse_page',
                follow=True),
        )
        print(self.deny_rules)

I know that this function is being called with certain paths are visited one hundred times without finding a file but the new rule is not being used. I also know that the regex works as I have tried defining one in the init and it blocked the desired path.

I would expect that all paths that are visited over 100 times without finding a file would be blocked and not visited further

I believe that the issue here is that the requests that you may want to avoid may already be queued within Scrapy's Scheduler. I believe that the rule will only work for whatever requests are being discovered from that point on. — phrfpeixoto, Jul 31 '19 at 21:09
Could you please share the spider code as well? Have you defined any class-level _rules_ variable? — phrfpeixoto, Jul 31 '19 at 21:19
Seems like the `cls.rules` value is used to populate `self._rules` during the class initialization. See that `cls.__init__` calls `self._compile_rules` that on its turn populates the `self._rules`. Because of that, I believe overriding `cls.rules` without recompiling them is worthless. (considering the class being used is CrawlSpider) — Victor Torres, Jul 31 '19 at 21:20
@VictorTorres Is there a way to recompile during runtime without losing valuable information? — Logan Anderson, Aug 01 '19 at 12:00
@phrfpeixoto I thought that it could be the case that it was blocking new requests but still following old ones but I tried letting the spider run for over 15 minutes after the url was "blocked" and it was still requesting the blocked path. — Logan Anderson, Aug 01 '19 at 12:03
What I have been doing to fix this issue, for the time being, is close the spider when is has visited a path to many times and then store that path in a database so that next time when the spider starts it will get the blocked path from the database and then use it. — Logan Anderson, Aug 01 '19 at 12:07
I'm not sure you can modify these rules in during runtime. I'm still checking on the code to try to make a sense of this. But it seems that Scrapy's `Rules` have built-in `process_links` and `process_requests` callbacks for filtering purpose. Have you tried using those? — phrfpeixoto, Aug 01 '19 at 22:03
@LoganAnderson, before trying to override private attributes and playing with the CrawlSpider implementation, take a look at @phrfpeixoto's idea as it seems to be a little bit more safe and correct. Check if you're able to work with the `process_links` callback on the `Rule` class. https://docs.scrapy.org/en/latest/topics/spiders.html?highlight=process_links#scrapy.spiders.Rule — Victor Torres, Aug 02 '19 at 00:08
@VictorTorres yes I will take a look at that. Thank you so much for your help — Logan Anderson, Aug 02 '19 at 11:28

Is there a way to block certain URLs while a spider is running

0 Answers0