0

I am new to Scrapy . I am trying to download files using media pipeline. But when I am running spider no files are stored in the folder.

spider:

import scrapy
from scrapy import Request
from pagalworld.items import PagalworldItem

class JobsSpider(scrapy.Spider):
    name = "songs"
    allowed_domains = ["pagalworld.me"]
    start_urls =['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html']

    def parse(self, response):
        urls = response.xpath('//div[@class="pageLinkList"]/ul/li/a/@href').extract()

        for link in urls:

            yield Request(link, callback=self.parse_page, )




    def parse_page(self, response):
        songName=response.xpath('//li/b/a/@href').extract()
        for song in songName:
            yield Request(song,callback=self.parsing_link)


    def parsing_link(self,response):
        item= PagalworldItem()
        item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract()
        yield{"download_link":item['file_urls']}

Item file:

import scrapy


class PagalworldItem(scrapy.Item):


    file_urls=scrapy.Field()

Settings File:

BOT_NAME = 'pagalworld'

SPIDER_MODULES = ['pagalworld.spiders']
NEWSPIDER_MODULE = 'pagalworld.spiders'
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 5
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {

'scrapy.pipelines.files.FilesPipeline': 1
}
FILES_STORE = '/tmp/media/'

The output looks like this:enter image description here

emon
  • 35
  • 9
  • You do not have written any code to download/save the files. Go here and get some idea. https://stackoverflow.com/questions/36135809/using-scrapy-to-to-find-and-download-pdf-files-from-a-website Hope this helps – Nabin Aug 03 '17 at 05:06

1 Answers1

3
def parsing_link(self,response):
    item= PagalworldItem()
    item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract()
    yield{"download_link":item['file_urls']}

You are yielding:

yield {"download_link": ['http://someurl.com']}

where for scrapy's Media/File pipeline to work you need to yield and item that contains file_urls field. So try this instead:

def parsing_link(self,response):
    item= PagalworldItem()
    item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract()
    yield item
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • Earlier I was trying crawlspider for parsing but it didn't work https://stackoverflow.com/questions/45447451/scrapy-results-are-repeating could you see it – emon Aug 03 '17 at 05:59