Downloading files with ItemLoaders() in Scrapy

Question

I created a crawl spider to download files. However the spider downloaded only the urls of the files and not the files themselves. I uploaded a question here Scrapy crawl spider does not download files? . While the the basic yield spider kindly suggested in the answers works perfectly, when I attempt to download files with items or item loaders the spider does not work! The original question does not include the items.py. So there it is:

ITEMS

import scrapy
from scrapy.item import Item, Field


class DepositsusaItem(Item):
    # main fields
    name = Field()
    file_urls = Field()
    files = Field()
    # Housekeeping Fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()
    pass

EDIT: added original code EDIT: further corrections

SPIDER

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from us_deposits.items import DepositsusaItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urllib.parse import urljoin


class DepositsSpider(CrawlSpider):
    name = 'deposits'
    allowed_domains = ['doi.org']
    start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//*[@id="products"][1]/p/a'),
             callback='parse_x'),
    )

    def parse_x(self, response):
        i = ItemLoader(item=DepositsusaItem(), response=response)
        i.add_xpath('name', '//*[@class="container"][1]/header/h1/text()')
        i.add_xpath('file_urls', '//span[starts-with(@data-url, "/catalog/file/get/")]/@data-url',
                    MapCompose(lambda i: urljoin(response.url, i))
                    )
        i.add_value('url', response.url)
        i.add_value('project', self.settings.get('BOT_NAME'))
        i.add_value('spider', self.name)
        i.add_value('server', socket.gethostname())
        i.add_value('date', datetime.datetime.now())
        return i.load_item()

SETTINGS

BOT_NAME = 'us_deposits'
SPIDER_MODULES = ['us_deposits.spiders']
NEWSPIDER_MODULE = 'us_deposits.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
    'us_deposits.pipelines.UsDepositsPipeline': 1,
    'us_deposits.pipelines.FilesPipeline': 2
}

FILES_STORE = 'C:/Users/User/Documents/Python WebCrawling Learning Projects'

PIPELINES

class UsDepositsPipeline(object):
    def process_item(self, item, spider):
        return item


class FilesPipeline(object):
    def process_item(self, item, spider):
        return item

Did you actually modify your code to include the `file_urls` field? (Even if you have the same code, you should still include it in the question instead of linking to another one) — stranac, Dec 08 '18 at 15:12

score 2 · Accepted Answer · answered Dec 08 '18 at 16:27

2

It seems to me that using items and/or item loaders has nothing to do with your problem.

The only problems I see are in your settings file:

FilesPipeline is not activated (only us_deposits.pipelines.UsDepositsPipeline is)
FILES_STORE should be a string, not a set (an exception is raised when you activate the files pipeline)
ROBOTSTXT_OBEY = True will prevent the downloading of files

If I correct all of those issues, the file download works as expected.

answered Dec 08 '18 at 16:27

stranac

26,638
5
25
30

Than you very much for the response. Followed your corrections and edited the code above accordingly. For some reason it is still not working! – GKV Dec 08 '18 at 17:06
`FilesPipeline` should be scrapy's `scrapy.pipelines.files.FilesPipeline`, not a thing you write yourself. – stranac Dec 08 '18 at 17:33
Thank you! Thank you! Thank you! I have been trying for more than 10 days to get it to work! You are the best! – GKV Dec 08 '18 at 18:10

Downloading files with ItemLoaders() in Scrapy

1 Answers1