0

I'm scraping this page:

http://www.mymcpl.org/cfapps/botb/movie.cfm

Extracting four items: book, author, movie,movie_year

I want to save this in a CSV file where each row contain records of one movie.

This is the spider I wrote:

class simple_spider(scrapy.Spider):
    name = 'movies_spider'
    allowed_domains = ['mymcpl.org']
    download_delay = 1


    start_urls = ['http://www.mymcpl.org/cfapps/botb/movie.cfm?browse={}'.format(letter) for letter in string.uppercase] # ['http://www.mymcpl.org/cfapps/botb/movie.cfm']


    def parse(self, response):
        xpaths = {'book':'//*[@id="main"]/tr[{}]/td[2]/text()[1]',
                  'author':'//*[@id="main"]/tr[{}]/td[2]/a/text()',
                  'movie':'//*[@id="main"]/tr[{}]/td[1]/text()[1]',
                  'movie_year':'//*[@id="main"]/tr[{}]/td[1]/a/text()'}

        data  = {key:[] for key in xpaths}
        for row in range(2,len(response.xpath('//*[@id="main"]/tr').extract()) + 1):
            for key in xpaths.keys():
                value = response.xpath(xpaths[key].format(row)).extract_first()
                data[key] = (value) 
        yield data.values()

to run the spider:

scrapy runspider m_spider.py output.csv

I'm having two problems here:

1) Each row of the CSV file contains no only the current record but all the previous records too even though I'm not appending the values in the dictionary

2) the spider is only scraping only the firt page of start_urls.

Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181

1 Answers1

2

Scrapy already has in-built csv exporter. All you have to do is yield items and scrapy will output those items to csv file.

def parse(self, response):
    xpaths = {'book':'//*[@id="main"]/tr[{}]/td[2]/text()[1]',
              'author':'//*[@id="main"]/tr[{}]/td[2]/a/text()',
              'movie':'//*[@id="main"]/tr[{}]/td[1]/text()[1]',
              'movie_year':'//*[@id="main"]/tr[{}]/td[1]/a/text()'}
    return {key:[] for key in xpaths}

Then just:

scrapy crawl myspider --output results.csv 

* note the csv part, scrapy can also output to .json and .jl (json lines) instead of csv, just change the file extension in the argument.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82