I was wondering whether it is possible to use scrapy Request to check the validity of urls before proceeding to actual processing of a page (the urls are not known in advance, but different patterns in which they appear may be tested).
Example code which fails is below.
(used the reties variable for simplicity, the test condition could also be s.th like
if response.code != 200
)
The code fails because upon the end of the second callback (parse_page_2
) control is not returned to the first callback (parse_page_1
) even when a new request is issued, having as callback parse_page_1
.
Why does this happen?
I am aware of the urllib2
- based solution indicated here, just checking if this can be accomplished strictly within scrapy context.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.http import Request
class MySpider(CrawlSpider):
name = 'alexa'
allowed_domains = ['alexa.com']
start_urls = ['http://www.alexa.com']
retries = 0
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Rule(LinkExtractor(allow=('topsites', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('topsites', )), callback='parse_page1'),
)
def parse_page1(self, response):
if self.retries < 5:
self.retries += 1
print 'Retries in 1: ', self.retries
return scrapy.Request("http://www.alexa.com/siteieekeknfo/google.com",
meta={'dont_merge_cookies': True,
'dont_redirect': False,
"handle_httpstatus_list": [301, 302, 303, 404]},
callback=self.parse_page2)
else:
print "Finished in 1"
def parse_page2(self, response):
if self.retries < 5:
self.retries += 1
print 'Retries in 2: ', self.retries
return scrapy.Request("http://www.alexa.com/siteieekeknfo/google.com",
meta={'dont_merge_cookies': True,
'dont_redirect': False,
"handle_httpstatus_list": [301, 302, 303, 404]},
callback=self.parse_page1)
else:
print "Finished in 2"
The crawl result is pasted here.