0

I was wondering whether it is possible to use scrapy Request to check the validity of urls before proceeding to actual processing of a page (the urls are not known in advance, but different patterns in which they appear may be tested). Example code which fails is below. (used the reties variable for simplicity, the test condition could also be s.th like if response.code != 200 )

The code fails because upon the end of the second callback (parse_page_2) control is not returned to the first callback (parse_page_1) even when a new request is issued, having as callback parse_page_1. Why does this happen? I am aware of the urllib2 - based solution indicated here, just checking if this can be accomplished strictly within scrapy context.

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = 'alexa'
    allowed_domains = ['alexa.com']
    start_urls = ['http://www.alexa.com']
    retries = 0
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Rule(LinkExtractor(allow=('topsites', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('topsites', )), callback='parse_page1'),
    )

    def parse_page1(self, response):
        if self.retries < 5:
            self.retries += 1
            print 'Retries in 1: ', self.retries
            return scrapy.Request("http://www.alexa.com/siteieekeknfo/google.com",
                                 meta={'dont_merge_cookies': True,
                                'dont_redirect': False,
                                "handle_httpstatus_list": [301, 302, 303, 404]},
                               callback=self.parse_page2)
        else:
            print "Finished in 1"

    def parse_page2(self, response):
        if self.retries < 5:
            self.retries += 1
            print 'Retries in 2: ', self.retries
            return scrapy.Request("http://www.alexa.com/siteieekeknfo/google.com",
                                 meta={'dont_merge_cookies': True,
                                'dont_redirect': False,
                                "handle_httpstatus_list": [301, 302, 303, 404]},
                               callback=self.parse_page1)
        else:
            print "Finished in 2"

The crawl result is pasted here.

Community
  • 1
  • 1
pkaramol
  • 16,451
  • 43
  • 149
  • 324
  • Your are probably using the wrong spider. Try spiders.Spider and yield directly from start_requests, since you already know your urls. http://scrapy.readthedocs.org/en/latest/topics/spiders.html#scrapy-spider – digenishjkl Oct 07 '15 at 10:16

1 Answers1

0

Recursive callback seems to work:

import random


from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = 'alexa'
    allowed_domains = ['alexa.com']
    start_urls = ['http://www.alexa.com']    
    rules = (
        Rule(LinkExtractor(allow=('topsites', )), callback='parse_page1'),
    )

    _retries = 0

    _random_urls = [
        'http://www.alexa.com/probablydoesnotexist',
        'http://www.alexa.com/neitherdoesthis',
        'http://www.alexa.com/siteinfo/google.com'
    ]

    def parse_page1(self, response):
        print "Got status: ", response.status
        if self._retries == 0 or response.status != 200:
            self._retries += 1
            print 'Retries in 1: ', self._retries
            return Request(random.choice(self._random_urls),
                                 meta={"handle_httpstatus_list": [301, 302, 303, 404]},
                               callback=self.parse_page1)
        else:
            print "Exiting"
pkaramol
  • 16,451
  • 43
  • 149
  • 324