I know this must be a newbie question, but I can't find how to use the actual href link pointing to an mp3 file, to go to that link and download the mp3 file (or any file for that matter). I have tried the Documentation and various stackoverflow questions, but can't seem to figure it out.
Here is my code setup:
settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'audio_files'
items.py
from scrapy.item import Item, Field
class Mp3projectItem(Item):
title = Field()
mp3_link = Field()
file_urls = Field()
# calculated fields
files = Field()
# Log fields
url = Field()
date = Field()
spider.py
import scrapy
from mp3_project.items import Mp3projectItem
class Mp3pipeSpider(scrapy.Spider):
name = 'mp3pipe'
allowed_domains = ['<thewebsite>.com']
start_urls = ['https://<thewebsite>.com/foo/bar/']
def parse(self, response):
item = Mp3projectItem()
item['title'] = response.xpath("//*[@class='spam-title']/a//text()").extract_first()
item['mp3_link'] = response.xpath("//*[@class='spam-content']//a/@href").extract_first()
item['url'] = response.url
return item
for url in response.xpath("//*[@class='spam-content']//a/@href").extract_first():
# Could I have used: for url in item['mp3_link']?
yield Request(url, callback=self.parse_item)
# in scrapy shell the response brings back the absolute url so no need
# for urlparse.urljoin(response.url, url)
# also throughs up SyntaxError: 'return' with argument inside generator
# this is obviously wrong and feels like over kill with use of pipeline
# but I don't know where to put the file_urls = Field() because the mp3 file is
# in an embedded link
def parse_item(self, response):
filename = response.url.split("/")[-1]
with open(filename, 'wb') as f:
f.write(response.body)
I am almost positive I don't even need the filename open/write implementation because the pipeline is designed to do this. But I can't figure out where in the spider to put the file_urls field and to get the pipeline to work. Any help would be very much appreciated.