Unable to force a script to retry for five times unless a 200 status in between

Question

I've created a script using scrapy which is capable of retrying some links from a list recursively even when those links are invalid and get 404 response. I used dont_filter=True and 'handle_httpstatus_list': [404] within meta to achieve the current behavior. What I'm trying to do now is let the script do the same for 5 times unless there is 200 status in between. I've included "max_retry_times":5 within meta considering the fact that it will keep retrying at most five times but it just retries infinitely.

I've tried so far:

import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"
    start_urls = [
        "https://stackoverflow.com/questions/taggedweb-scraping",
        "https://stackoverflow.com/questions/taggedweb-scraping"
    ]

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)

    def parse(self,response):
        if response.meta.get("start_url"):
            start_url = response.meta.get("start_url")

        soup = BeautifulSoup(response.text,'lxml')
        if soup.select(".summary .question-hyperlink"):
            for item in soup.select(".summary .question-hyperlink"):
                title_link = response.urljoin(item.get("href"))
                print(title_link)

        else:
            print("++++++++++"*20) # to be sure about the recursion
            yield scrapy.Request(start_url,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True,callback=self.parse)
            
if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
    })
    c.crawl(StackoverflowSpider)
    c.start()

How can I let the script keep retrying at most five times?

Note: There are multiple urls in the list which are identical. I don't wish to kick out the duplicate links. I would like to let scrapy use all of the urls.

Just a comment here, you don't need bs4 when you're using scrapy. You could simply `response.css(".summary .question-hyperlink::attr(href)").getall()`. [Check Scrapy docs for more info on selectors](https://docs.scrapy.org/en/latest/topics/selectors.html) — Thiago Curvelo, Sep 08 '20 at 03:09
Thanks @Thiago Curvelo for the pointer. I'm already aware of that. — MITHU, Sep 08 '20 at 06:01

score 5 · Accepted Answer · answered Sep 04 '20 at 15:57

5

I can propose following directions:
1. add 404 code to RETRY_HTTP_CODES setting as it doesn't include response code 404 by default.

is capable of retrying a link recursively even when the link is invalid and get 404 response

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"
    custom_settings = {
        'RETRY_HTTP_CODES' : [500, 502, 503, 504, 522, 524, 408, 429 , 404],
        'RETRY_TIMES': 5 # usage of "max_retry_times" meta key is also valid
    }
....

with dont_filter=True - scrapy application will visit previously visited pages.
removing dont_filter=True from your code should solve infinite loop issue

but it just retries infinitely.

answered Sep 04 '20 at 15:57

Georgiy

3,158
1
6
18

Your solution seems to be working in the right way. What if two such identical links in the list? Should I not stick with `dont_filter=True` so that scrapy uses both of the identical links? – MITHU Sep 05 '20 at 05:54
If You have two identical urls in `start_urls` and if there is no `dont_filter=True` in `start_requests` method - request to only first url will be called, second url will be filtered.as duplicate – Georgiy Sep 05 '20 at 15:28
3.keep `dont_filter=True` and add [`DEPTH_LIMIT`](https://docs.scrapy.org/en/latest/topics/settings.html#depth-limit) setting. In this case application will not filter duplicate links and in the same time it will not have endless lopp due to depth limitation. – Georgiy Sep 07 '20 at 14:55
It seems promising @Georgiy. Could you help me comply with this approach as it is foreign to me? Thanks. – MITHU Sep 07 '20 at 16:03

score 1 · Answer 2 · answered Sep 09 '20 at 08:37

import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"
    start_urls = [
        "https://stackoverflow.com/questions/taggedweb-scraping",
        "https://stackoverflow.com/questions/taggedweb-scraping"
    ]

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,meta={"request_count":0,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)

    def parse(self,response):
        
        soup = BeautifulSoup(response.text,'lxml')
        if soup.select(".summary .question-hyperlink"):
            for item in soup.select(".summary .question-hyperlink"):
                title_link = response.urljoin(item.get("href"))
                print(title_link)

        else:
            request_count = response.meta.get("request_count")
            max_retry_times = response.meta.get("max_retry_times")
            if request_count < max_retry_times :
                start_url = response.url
                request_count += 1    
                yield scrapy.Request(start_url,meta={"request_count":request_count,'handle_httpstatus_list': [404],"max_retry_times":max_retry_times },dont_filter=True,callback=self.parse)
            
if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
    })
    c.crawl(StackoverflowSpider)
    c.start()

I think so. please let me know your opinion if I had mistakes.

Regards!

score 0 · Answer 3 · answered Sep 03 '20 at 11:45

I don't know anything about scrapy, so I apologise if this solution will not work.

But for a simple counter I've found a while loop works well. It would look something like this:

x = 5
while not x == 0:
    for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)
    # 200 status check goes here using an if statement that results in a break statement if true
    x -= 1

So you perform the action that you wish to perform, then minus 1 from your counter. the action loops through again and again and you minus one from your counter. This repeats until your counter reaches zero, at which point your while loop breaks and you stop looping through your code.

To add a check for 200 status you add an if check into your code, which I'm afraid I'm not sure how to do, and place it before x -= 1. If the 200 status is True you add a break statement to exit the while loop. Alternatively if you want to use the x counter later on in your code (say you are running through several different function checks) you could use x = 0 before your break statement.

Again, I don't know anything about scrapy yet, so I apologize if this is not a suitable solution.

Unable to force a script to retry for five times unless a 200 status in between

3 Answers3