Scrapy Downloading Media Files From Embedded HREF Links On A Page Using ITEM_PIPELINES

Question

I know this must be a newbie question, but I can't find how to use the actual href link pointing to an mp3 file, to go to that link and download the mp3 file (or any file for that matter). I have tried the Documentation and various stackoverflow questions, but can't seem to figure it out.

Here is my code setup:

settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'

ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'audio_files'

items.py

from scrapy.item import Item, Field


class Mp3projectItem(Item):
    title = Field()
    mp3_link = Field()
    file_urls = Field()

    # calculated fields
    files = Field()
    # Log fields
    url = Field()
    date = Field()

spider.py

import scrapy

from mp3_project.items import Mp3projectItem


class Mp3pipeSpider(scrapy.Spider):
    name = 'mp3pipe'
    allowed_domains = ['<thewebsite>.com']
    start_urls = ['https://<thewebsite>.com/foo/bar/']

    def parse(self, response):
        item = Mp3projectItem()
        item['title'] = response.xpath("//*[@class='spam-title']/a//text()").extract_first()
        item['mp3_link'] = response.xpath("//*[@class='spam-content']//a/@href").extract_first()
        item['url'] = response.url
        return item

        for url in response.xpath("//*[@class='spam-content']//a/@href").extract_first():
            # Could I have used: for url in item['mp3_link']?
            yield Request(url, callback=self.parse_item)
            # in scrapy shell the response brings back the absolute url so no need
            # for urlparse.urljoin(response.url, url)
            # also throughs up SyntaxError: 'return' with argument inside generator


# this is obviously wrong and feels like over kill with use of pipeline
# but I don't know where to put the file_urls = Field() because the mp3 file is
# in an embedded link
def parse_item(self, response):
    filename = response.url.split("/")[-1]
    with open(filename, 'wb') as f:
        f.write(response.body)

I am almost positive I don't even need the filename open/write implementation because the pipeline is designed to do this. But I can't figure out where in the spider to put the file_urls field and to get the pipeline to work. Any help would be very much appreciated.

I have figured out how to amend the code to download the files by using the solution provided by @granitosaurus on [This Stack Flow Post](https://stackoverflow.com/questions/45475184/scrapy-media-pipeline-files-not-downloading) . However, I have to amend the code to extract all files (extract()) instead of extract_first() and do not know why. — R.Zane, Sep 26 '18 at 21:15
it's because scrapy's media/file pipeline expects `file_urls` to be a list of urls rather than just one url. `extract()` returns list of all values matching your xpath/css expression, while `extract_first()` returns only the first one. — Granitosaurus, Sep 27 '18 at 03:32
@granitosaurus Very Helpful (the second time if I reflect your very insightful solution I aforementioned). Thanks. I am a newbie...came across this in various documentation/web resources...this comment confirmed my suspicion. I saw these two links [number one] (https://stackoverflow.com/questions/51543561/understandin-how-rename-images-scrapy-works) and [number two] (https://stackoverflow.com/questions/48353047/scrapy-image-pipeline-how-to-rename-images) which incorporate 'extract_first()' so I am confused. Will do some newbie thinking .Thx again for the clean/articulate solution and headsup! — R.Zane, Sep 28 '18 at 04:58

Scrapy Downloading Media Files From Embedded HREF Links On A Page Using ITEM_PIPELINES

0 Answers0