17

I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.

From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?

Pawel Miech
  • 7,742
  • 4
  • 36
  • 57
John Lotacs
  • 1,184
  • 4
  • 20
  • 34

3 Answers3

18

Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

If you choose to do it in a pipeline:

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)

R. Max
  • 6,624
  • 1
  • 27
  • 34
  • Does this work with CrawlSpider, because I spent hours trying to implement the saving of the pdf in the spider & the callback function never gets called. – Kex Nov 12 '11 at 19:57
  • @Kex hard to tell what's wrong without seeing your code. A common pitfall is overriding `parse` callback or not using the right pattern in the link extractors. – R. Max Nov 13 '11 at 20:06
  • I solved the problem without this, now I download the pdf files using SgmlLinkExtractor within the rules & save the response into a pdf file. – Kex Nov 13 '11 at 23:02
  • @Kex: I am trying to build a similar system. Can you tell me how exactly did you make the SgmlLinkExtractor to do that for you? – kidd0 Jan 17 '14 at 12:39
  • 1
    @bi0s.kidd0, maybe your are looking for something like `Rule(SgmlLinkExtractor(allow=r"\.pdf"), callback="save_pdf")`. – R. Max Jan 17 '14 at 13:01
  • @Rho: Thanks.. yeah I understood that. My goal is to download exe files. And I am playing around with scrapy. I am confused in many ways. Which is the best way to do this. In the pipelines or spider or using Link extractor. – kidd0 Jan 17 '14 at 14:29
  • @Rho: Also can you comment on the answer posted by @Deming? Is that something scrapy provides? or some custom made code? – kidd0 Jan 17 '14 at 14:32
  • @bi0s.kidd0 the link extract + crawl spider only saves you a few steps of extracting the links of the page and building the requests. Simplest way is to do it like in this answer, but the right way would be using the `FilesPipeline`: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/pipeline/files.py (Unfortunately is a bit undocumented, but basically you enable the pipeline and use the `file_urls` field in your item. – R. Max Jan 17 '14 at 15:47
  • `self.get_path` doesn't work on my machine. And it's not needed anyway. We can just type the path on our own – Aminah Nuraini Nov 14 '15 at 02:50
  • @AminahNuraini Sorry for not being clear, `get_path` is a method you should implement in order to convert an URL into a suitable file path. – R. Max Nov 18 '15 at 01:14
9

There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

Paweł Szczur
  • 5,484
  • 3
  • 29
  • 32
Deming
  • 1,210
  • 12
  • 15
  • 1
    FilesPipeline link is deprecated. Use this one instead: https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py – Guillaume Nov 04 '16 at 09:30
4

It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.

In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.

Seb
  • 17,141
  • 7
  • 38
  • 27