I want to save each quote from http://quotes.toscrape.com/ saved into a csv file (2 field : author, quote). One other necessity is to save these quotes in different files seperated by the page they reside. ie : (page1.csv, page2.csv ...). I have tried to achieve this by declaring feed exports in custom_settings
attribute in my spider as shown below. This, however, doesn't even produce a file called page-1.csv
. I am a total beginner using scrapy, please try to explain assuming I know little to nothing.
import scrapy
import urllib
class spidey(scrapy.Spider):
name = "idk"
start_urls = [
"http://quotes.toscrape.com/"
]
custom_settings = {
'FEEDS' : {
'file://page-1.csv' : { #edit: uri needs to be absolute path
'format' : 'csv',
'store_empty' : True
}
},
'FEED_EXPORT_ENCODING' : 'utf-8',
'FEED_EXPORT_FIELDS' : ['author', 'quote']
}
def parse(self, response):
for qts in response.xpath("//*[@class=\"quote\"]"):
author = qts.xpath("./span[2]/small/text()").get()
quote = qts.xpath("./*[@class=\"text\"]/text()").get()
yield {
'author' : author,
'quote' : quote
}
next_pg = response.xpath('//li[@class="next"]/a/@href').get()
if next_pg is not None:
next_pg = urllib.parse.urljoin(self.start_urls[0], next_pg)
yield scrapy.Request(next_pg, self.parse)
How I ran the crawler: scrapy crawl idk
As an added question, I need my files to be overwritten as opposed to being appended like when specifying -o
flag. Is it possible to do it without having to manually check/delete preexisting files from spider?