1

http://doc.scrapy.org/en/latest/topics/media-pipeline.html

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

I want to do the exact opposite: Scrape all HTML urls first, then, download all media files at once. How can I do that?

Antoine Brunel
  • 1,065
  • 2
  • 14
  • 30

1 Answers1

2

Not an answer but if you're curious to know how this behavior is implemented, check MediaPipeline pipeline source code, especially the process_item method:

    def process_item(self, item, spider):
        info = self.spiderinfo
        requests = arg_to_iter(self.get_media_requests(item, info))
        dlist = [self._process_request(r, info) for r in requests]
        dfd = DeferredList(dlist, consumeErrors=1)
        return dfd.addCallback(self.item_completed, item, info)

You see that a bunch of requests are queued, to be processed (request sent + response downloaded) BEFORE item_completed is eventually called, returning the original item + the downloaded media info.

In the nominal case, requests generated by the MediaPipeline subclasses will be sent for download immediately by using crawler.engine.download directly:

        (...)
        else:
            request.meta['handle_httpstatus_all'] = True
            dfd = self.crawler.engine.download(request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        return dfd
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • Thank you, I am trying to evaluate the "best" way to go: 1. During the html crawl, store all files path in a queue then launch another scrapy spider to handle all files at once (simple solution) or 2. Derive from the media pipeline (elegant solution), but I also need to store all files path in a persistent queue in this case... If you can head me in any direction? I am not that interested in downloading files, but rather in detecting if they are available (200 or 404) and get their size in kb – Antoine Brunel Apr 23 '16 at 07:45