Urls expire due to timestamp authentication in scrapy

Question

I was trying to crawl amazon grocery uk, and to get the grocery categories, I was using the Associate Product Advertising api. My requests get enqueued however as the requests have an expiry of 15 mins, some requests are crawled after 15 mins of being enqueued which means they get expired by the time they are crawled and yield a 400 error. I was thinking of a solution of enqueueing requests in a batch, but even that will fail if the implementation controls processing them in batches as the problem is preparing the request in batches as opposed to processing them in batches. Unfortunately, Scrapy has little documentation for this use case, so how can requests be prepared in batches?

from scrapy.spiders import XMLFeedSpider
from scrapy.utils.misc import arg_to_iter
from scrapy.loader.processors import TakeFirst


from crawlers.http import AmazonApiRequest
from crawlers.items import (AmazonCategoryItemLoader)
from crawlers.spiders import MySpider


class AmazonCategorySpider(XMLFeedSpider, MySpider):
    name = 'amazon_categories'
    allowed_domains = ['amazon.co.uk', 'ecs.amazonaws.co.uk']
    marketplace_domain_name = 'amazon.co.uk'
    download_delay = 1
    rotate_user_agent = 1

    grocery_node_id = 344155031

    # XMLSpider attributes
    iterator = 'xml'
    itertag = 'BrowseNodes/BrowseNode/Children/BrowseNode'

    def start_requests(self):
        return arg_to_iter(
            AmazonApiRequest(
                qargs=dict(Operation='BrowseNodeLookup',
                           BrowseNodeId=self.grocery_node_id),
                meta=dict(ancestor_node_id=self.grocery_node_id)
            ))

    def parse(self, response):
        response.selector.remove_namespaces()
        has_children = bool(response.xpath('//BrowseNodes/BrowseNode/Children'))
        if not has_children:
            return response.meta['category']
        # here the request should be configurable to allow batching
        return super(AmazonCategorySpider, self).parse(response)

    def parse_node(self, response, node):
        category = response.meta.get('category')
        l = AmazonCategoryItemLoader(selector=node)
        l.add_xpath('name', 'Name/text()')
        l.add_value('parent', category)
        node_id = l.get_xpath('BrowseNodeId/text()', TakeFirst(), lambda x: int(x))
        l.add_value('node_id', node_id)
        category_item = l.load_item()
        return AmazonApiRequest(
            qargs=dict(Operation='BrowseNodeLookup',
                       BrowseNodeId=node_id),
            meta=dict(ancestor_node_id=node_id,
                      category=category_item)
        )

Could you post some spider code? usually people just batch requests with `spider_idle` signal - when spider goes idle, pop a batch and schedule some requests, see my related answer: http://stackoverflow.com/questions/43532976/scrapy-limit-on-start-url/43537446?s=2%7C0.1085#43537446 — Granitosaurus, Apr 24 '17 at 19:56

score 0 · Answer 1 · answered Apr 24 '17 at 21:37

One way of doing this:

Since there are two places where you yield requests you can leverage priority attribute to prioritise requests coming from parse method:

class MySpider(Spider):
    name = 'myspider'

    def start_requests(self):
        for url in very_long_list:
            yield Request(url)

    def parse(self, response):
        for url in short_list:
            yield Reuest(url, self.parse_item, priority=1000)

    def parse_item(self, response):
        # parse item

In this example scrapy will prioritize requests coming out from parse which will allow you to avoid the time limit.

See more on Request.priority:

priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

on scrapy docs

Urls expire due to timestamp authentication in scrapy

1 Answers1