2

I've been trying to get Scrapy to download all of the PDFs from a website, however I can't seem to get it to actually download the files. The crawler works fine and goes to all pages of the url, however nothing is downloaded.

I'm quite novice at Python and webscraping, so I'm not sure if I'm just overlooking or not understanding how to relate other peoples' issues with mine. I've followed a few tutorials and walkthroughs from the Scrapy website and others, but I just can't get my head around it.

In addition, I would also like to only download the files containing "spec_sheet" if possible (located in the downloads section of any of the lights on the website, and also name the PDFs what they are on the website and not a string of random letters and numbers when they download.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/Users/lukewoods/Desktop/AppDev/LightingPDFs/Martin'

class ZipfilesItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()


class MartinSpider(CrawlSpider):
    name = 'martinspider'
    start_urls = ['https://www.martin.com/en/']


    rules = (
        Rule(LinkExtractor(allow=r'products/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'product_families/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'discontinued_products/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        file_url = response.xpath(("//tr[@class='download-item-row']/a/@href")).get()
        file_url = response.urljoin(file_url)
        item = ZipfilesItem()
        item['file_urls'] = [file_url]
        yield item
bad_coder
  • 11,289
  • 20
  • 44
  • 72
Luke Woods
  • 21
  • 2

1 Answers1

0

After lot of try , I have achieved the results. In parse_item file_url is not exactly the pdf url. Please check the below code. And please follow the url . Hope its help you and thanks for your question.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MartinSpider(CrawlSpider):
    name = 'martinspider'
    start_urls = ['https://www.martin.com/en/',]
    rules = (
        Rule(LinkExtractor(allow=r'products/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'product_families/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'discontinued_products/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response.url)

        file_url = response.css("div.small-11.columns a::attr('href')").get(default='')
        if file_url:
            file_url = response.urljoin(file_url)
            yield scrapy.Request(url=file_url, callback=self.download_pdf)

    def download_pdf(self, response):
        path = response.url.split('/')[-1]+".pdf"
        with open(path, 'wb') as f:
            f.write(response.body)
Samsul Islam
  • 2,581
  • 2
  • 17
  • 23