Ignoring requests while scraping two pages

Question

I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).

The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to retrieve additional information about the item. DeltaFetch works well in ignoring requests to page B, but that also means that every time the scraping runs, it runs requests to page A regardless of whether it has visited it or not.

This is how my code is structured right now:

# Gathering links from a page, creating an item, and passing it to parse_A
def parse(self, response):
    for href in response.xpath(u'//a[text()="詳細を見る"]/@href').extract():
        item = ItemLoader(item=ItemClass(), response=response)
        yield scrapy.Request(response.urljoin(href), 
                                callback=self.parse_A,
                                meta={'item':item.load_item()})

# Parsing elements in page A, and passing the item to parse_B
def parse_A(self, response):
    item = ItemLoader(item=response.meta['item'], response=response)
    item.replace_xpath('age',u"//td[contains(@class,\"age\")]/text()")
    page_B = response.xpath(u'//a/img[@alt="周辺環境"]/../@href').extract_first()
    yield scrapy.Request(response.urljoin(page_B), 
                            callback=self.parse_B,
                            meta={'item':item.load_item()})

# Parsing elements in page B, and yielding the item
def parse_B(self, response):
    item = ItemLoader(item=response.meta['item'])
    item.add_value('url_B',response.url)
    yield item.load_item()

Any help would be appreciated to ignore the first request to page A when this page has already been visited, using DeltaFetch.

score 4 · Accepted Answer · answered Mar 01 '18 at 23:23

DeltaFetch only keeps record of the requests that yield items in its database, which means only those will be skipped by default.

However, you are able to customize the key used to store a record by using the deltafetch_key meta key. If you make this key the same for the requests that call parse_A() as for those created inside parse_A(), you should be able to achieve the effect you want.

Something like this should work (untested):

from scrapy.utils.request import request_fingerprint

# (...)

    def parse_A(self, response):
        # (...)
        yield scrapy.Request(
            response.urljoin(page_B),
            callback=self.parse_B,
            meta={
                'item': item.load_item(),
                'deltafetch_key': request_fingerprint(response.request)
            }
        )

Note: the example above effectively replaces the filtering of requests to parse_B() urls with the filtering of requests to parse_A() urls. You might need to use a different key depending on your needs.

This works perfectly! Thanks a lot for the explanation. – Abel Riboulot Mar 03 '18 at 00:44 — Abel Riboulot, Mar 03 '18 at 00:44

Ignoring requests while scraping two pages

1 Answers1