I'm scraping this page:
http://www.mymcpl.org/cfapps/botb/movie.cfm
Extracting four items: book, author, movie,movie_year
I want to save this in a CSV file where each row contain records of one movie.
This is the spider I wrote:
class simple_spider(scrapy.Spider):
name = 'movies_spider'
allowed_domains = ['mymcpl.org']
download_delay = 1
start_urls = ['http://www.mymcpl.org/cfapps/botb/movie.cfm?browse={}'.format(letter) for letter in string.uppercase] # ['http://www.mymcpl.org/cfapps/botb/movie.cfm']
def parse(self, response):
xpaths = {'book':'//*[@id="main"]/tr[{}]/td[2]/text()[1]',
'author':'//*[@id="main"]/tr[{}]/td[2]/a/text()',
'movie':'//*[@id="main"]/tr[{}]/td[1]/text()[1]',
'movie_year':'//*[@id="main"]/tr[{}]/td[1]/a/text()'}
data = {key:[] for key in xpaths}
for row in range(2,len(response.xpath('//*[@id="main"]/tr').extract()) + 1):
for key in xpaths.keys():
value = response.xpath(xpaths[key].format(row)).extract_first()
data[key] = (value)
yield data.values()
to run the spider:
scrapy runspider m_spider.py output.csv
I'm having two problems here:
1) Each row of the CSV file contains no only the current record but all the previous records too even though I'm not appending the values in the dictionary
2) the spider is only scraping only the firt page of start_urls.