I've created a script using scrapy which is capable of retrying some links from a list recursively even when those links are invalid and get 404
response. I used dont_filter=True
and 'handle_httpstatus_list': [404]
within meta
to achieve the current behavior. What I'm trying to do now is let the script do the same for 5 times unless there is 200 status
in between. I've included "max_retry_times":5
within meta
considering the fact that it will keep retrying at most five times but it just retries infinitely.
I've tried so far:
import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_urls = [
"https://stackoverflow.com/questions/taggedweb-scraping",
"https://stackoverflow.com/questions/taggedweb-scraping"
]
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)
def parse(self,response):
if response.meta.get("start_url"):
start_url = response.meta.get("start_url")
soup = BeautifulSoup(response.text,'lxml')
if soup.select(".summary .question-hyperlink"):
for item in soup.select(".summary .question-hyperlink"):
title_link = response.urljoin(item.get("href"))
print(title_link)
else:
print("++++++++++"*20) # to be sure about the recursion
yield scrapy.Request(start_url,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True,callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
How can I let the script keep retrying at most five times?
Note: There are multiple urls in the list which are identical. I don't wish to kick out the duplicate links. I would like to let scrapy use all of the urls.