2

I am using Scrapy and Scrapyd to monitor certain sites. The output files are compressed jsonlines. Right after I submit a job schedule to scrapyd, I can see the output file being created and is growing as it scrapes.

My problem is I can't be sure when the output file is ready, i.e. spider is completed. One way to do it is to rename the output file to something like "output.done" so my other programs can list these files and process them.

My current method is to check the modify time of the file, and if it doesn't change for five minutes then I assume it is done. However, five minute doesn't seem enough sometimes, and I really hope I don't need to extend it to 30min.

Andy
  • 1,231
  • 1
  • 15
  • 27

2 Answers2

1

You may want to use scrapy signals particularly spider_opened and spider_closed to know when spider is using the file. More info could be found here: http://doc.scrapy.org/en/latest/topics/signals.html

spider_opened could rename the file as output.progress and spider_closed could rename it as output.done to indicate file is no longer in use by the spider.

If the output file is written by Item pipeline, than open_spider and close_spider callbacks could be used which would be the same logic as using signals. More info about the item pipeline callbacks: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline.

sardok
  • 1,086
  • 1
  • 10
  • 19
  • Thank you. Using signal is a good idea, but any thought on how it works with existing pipeline? Specifically, I am not sure if I can just rename the file at the beginning and the end of the spider. Won't it break the exporter output file handler? – Andy May 06 '15 at 13:37
  • If you are using item pipeline to write the output file, you may want to use spider_open/spider_close callbacks: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline – sardok May 06 '15 at 13:53
0

I got a working solution after trying different approaches. Since in my particular case I dump the output into files, specifically bz2 files. I customized a FileFeedStorage to do the job before opening and after closing the file. See code below:

from scrapy.contrib.feedexport import FileFeedStorage
import os
import bz2

MB = 1024 * 1024


class Bz2FileFeedStorage(FileFeedStorage):
  IN_PROGRESS_MARKER = ".inprogress"

  def __init__(self, uri):
    super(Bz2FileFeedStorage, self).__init__(uri)
    self.in_progress_file = self.path + Bz2FileFeedStorage.IN_PROGRESS_MARKER

  def open(self, spider):
    dirname = os.path.dirname(self.path)
    if dirname and not os.path.exists(dirname):
      os.makedirs(dirname)
    return bz2.BZ2File(self.in_progress_file, "w", 10 * MB)

  def store(self, file):
    super(Bz2FileFeedStorage, self).store(file)
    os.rename(self.in_progress_file, self.path)
Andy
  • 1,231
  • 1
  • 15
  • 27