I've been trying to get Scrapy to download all of the PDFs from a website, however I can't seem to get it to actually download the files. The crawler works fine and goes to all pages of the url, however nothing is downloaded.
I'm quite novice at Python and webscraping, so I'm not sure if I'm just overlooking or not understanding how to relate other peoples' issues with mine. I've followed a few tutorials and walkthroughs from the Scrapy website and others, but I just can't get my head around it.
In addition, I would also like to only download the files containing "spec_sheet" if possible (located in the downloads section of any of the lights on the website, and also name the PDFs what they are on the website and not a string of random letters and numbers when they download.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/Users/lukewoods/Desktop/AppDev/LightingPDFs/Martin'
class ZipfilesItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
class MartinSpider(CrawlSpider):
name = 'martinspider'
start_urls = ['https://www.martin.com/en/']
rules = (
Rule(LinkExtractor(allow=r'products/'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow=r'product_families/'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow=r'discontinued_products/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
file_url = response.xpath(("//tr[@class='download-item-row']/a/@href")).get()
file_url = response.urljoin(file_url)
item = ZipfilesItem()
item['file_urls'] = [file_url]
yield item