0

In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code):

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, dont_filter=True)

    # attempt to crawl orphaned items
    db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'],
                         port=self.settings['AWS_RDS_PORT'],
                         user=self.settings['AWS_RDS_USER'],
                         passwd=self.settings['AWS_RDS_PASSWD'],
                         db=self.settings['AWS_RDS_DB'],
                         cursorclass=MySQLdb.cursors.DictCursor,
                         use_unicode=True,
                         charset="utf8",)
    c=db.cursor()

    c.execute("""SELECT p.url FROM products p LEFT JOIN product_data pd ON p.id = pd.product_id AND pd.scrape_date = CURDATE() WHERE p.website_id = %s AND pd.id IS NULL""", (self.website_id,))

    while True:
        url = c.fetchone()
        if url is None:
            break
        # record orphaned product
        self.crawler.stats.inc_value('orphaned_count')
        yield Request(url['url'], callback=self.parse_item)
    db.close()

Unfortunately, it appears as though the crawler queues up these orphaned items during the rest of the crawl - so, in effect, too many are regarded as orphaned (because the crawler has not yet retrieved these items in the normal crawl, when the database query is executed).

I need this orphaned process to happen at the end of the crawl - so I believe I need to use the spider_idle signal.

However, my understanding is I can't just simply yield requests in my spider idle method - instead I can use self.crawler.engine.crawl?

I need requests to be processed by my spider's parse_item() method (and for my configured middleware, extensions and pipelines to be obeyed). How can I achieve this?

BrynJ
  • 8,322
  • 14
  • 65
  • 89

1 Answers1

2

the idle method that was connected to the idle signal (let's say the idle method is called idle_method) should receive the spider as argument, so you could do something like:

def idle_method(self, spider):
    self.crawler.engine.crawl(Request(url=myurl, callback=spider.parse_item), spider)
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • thanks, just what I was looking for. Related to this question, could you take a look here - https://stackoverflow.com/questions/46073577/scrapy-spider-idle-signal-not-received-in-my-extension - I can't get the `spider_idle` signal to fire in my extension. – BrynJ Sep 06 '17 at 11:06