18

I made the improvement according to the suggestion from alexce below. What I need is like the picture below. However each row/line should be one review: with date, rating, review text and link.

I need to let item processor process each review of every page.
Currently TakeFirst() only takes the first review of the page. So 10 pages, I only have 10 lines/rows as in the picture below.

enter image description here

Spider code is below:

import scrapy
from amazon.items import AmazonItem

class AmazonSpider(scrapy.Spider):
   name = "amazon"
   allowed_domains = ['amazon.co.uk']
   start_urls = [
    'http://www.amazon.co.uk/product-reviews/B0042EU3A2/'.format(page) for      page in xrange(1,114)

]

def parse(self, response):
    for sel in response.xpath('//*[@id="productReviews"]//tr/td[1]'):
        item = AmazonItem()
        item['rating'] = sel.xpath('div/div[2]/span[1]/span/@title').extract()
        item['date'] = sel.xpath('div/div[2]/span[2]/nobr/text()').extract()
        item['review'] = sel.xpath('div/div[6]/text()').extract()
        item['link'] = sel.xpath('div/div[7]/div[2]/div/div[1]/span[3]/a/@href').extract()

        yield item
Community
  • 1
  • 1
W.S.
  • 647
  • 1
  • 6
  • 19
  • You want only the review text to be in the output, right? – alecxe Apr 29 '15 at 12:00
  • @alecxe no sir. just as an example. I would like to have rating, date, review, link as 4 different columns in excel. Thank you! – W.S. Apr 29 '15 at 12:33
  • @alecxe this is my attempt below. it did not work. probably because i do not understand the mechanic for pipeline. import csv class CsvWriterPipeline(object): def __init__(self): self.csvwriter = csv.writer(open('amazon.csv', 'wb')) def process_item(self, item, spider): self.csvwriter.writenow(item['rating'], item['date'], item['review'], item['link']) return item – W.S. Apr 29 '15 at 14:19
  • 1
    Why do you want to care for the CSV export yourself? You could also use `scrapy crawl amazon -t csv -o Output_File.csv`to get a csv file with your fields. This can then be imported to your favorite spreadsheet program. – Frank Martin Apr 29 '15 at 14:44
  • @frankmartin I need to export data into columns for post data processing. the command line by default is in xml format. so it is not in columns that are needed. – W.S. Apr 29 '15 at 14:51
  • 1
    If you use the `-t csv` on the command line the format will be CSV Format ... maybe you want to give it a try!? And have a look at the [documentation](http://doc.scrapy.org/en/stable/topics/feed-exports.html#csv). – Frank Martin Apr 29 '15 at 15:03
  • @frankmartin thx. but the issue is not that I am not able to export to CSV file. it is that I am not able to export to CSV with right formatting, which allows me to open with column view, not standard xml view. btw...I did try before I post. – W.S. Apr 29 '15 at 18:40
  • @alecxe I am thinking to use class scrapy.contrib.exporter.CsvItemExporter(file, include_headers_line=True, join_multivalued=', ', **kwargs).. but I am not able to set it up properly. your help is highly appreciated! – W.S. Apr 29 '15 at 19:02
  • Can you edit your question and add an exact example of the expected output? I simply don't get what you want - Sorry – Frank Martin Apr 29 '15 at 21:16
  • @frankmartin thanks for trying to help! I just added a picture, hopefully it is clear for you now. let me know. So I would like the data in column - vertical. the standard csv output does not allow to have that, it is more a horizontal view. – W.S. Apr 29 '15 at 21:44
  • @alecxe Could you also help here? Thanks in advance! – W.S. Apr 29 '15 at 21:46
  • With the `-t csv` command line option I always get the structure you describe. What does it look like when you use that option? Maybe you can add also how the default output looks like for you? (Will check back tomorrow) – Frank Martin Apr 29 '15 at 22:02

2 Answers2

29

I started from scratch and the following spider should be run with

scrapy crawl amazon -t csv -o Amazon.csv --loglevel=INFO

so that opening the CSV-File with a spreadsheet shows for me

enter image description here

Hope this helps :-)

import scrapy

class AmazonItem(scrapy.Item):
    rating = scrapy.Field()
    date = scrapy.Field()
    review = scrapy.Field()
    link = scrapy.Field()

class AmazonSpider(scrapy.Spider):

    name = "amazon"
    allowed_domains = ['amazon.co.uk']
    start_urls = ['http://www.amazon.co.uk/product-reviews/B0042EU3A2/' ]

    def parse(self, response):

        for sel in response.xpath('//table[@id="productReviews"]//tr/td/div'):

            item = AmazonItem()
            item['rating'] = sel.xpath('./div/span/span/span/text()').extract()
            item['date'] = sel.xpath('./div/span/nobr/text()').extract()
            item['review'] = sel.xpath('./div[@class="reviewText"]/text()').extract()
            item['link'] = sel.xpath('.//a[contains(.,"Permalink")]/@href').extract()
            yield item

        xpath_Next_Page = './/table[@id="productReviews"]/following::*//span[@class="paging"]/a[contains(.,"Next")]/@href'
        if response.xpath(xpath_Next_Page):
            url_Next_Page = response.xpath(xpath_Next_Page).extract()[0]
            request = scrapy.Request(url_Next_Page, callback=self.parse)
            yield request
Frank Martin
  • 2,584
  • 2
  • 22
  • 25
  • You are Great!!! thank you! it worked like a charm. occasionally I would miss a link/url here and there. But it is nothing major, I can continue my next step for post data processing now! – W.S. Apr 30 '15 at 18:37
23

If using -t csv (as proposed by Frank in comments) does not work for you for some reason, you can always use built-in CsvItemExporter directly in the custom pipeline, e.g.:

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter


class AmazonPipeline(object):
    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        self.file = open('output.csv', 'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

which you need to add to ITEM_PIPELINES:

ITEM_PIPELINES = {
    'amazon.pipelines.AmazonPipeline': 300
}

Also, I would use an Item Loader with input and output processors to join the review text and replace new lines with spaces. Create an ItemLoader class:

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, Join, MapCompose


class AmazonItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

    review_in = MapCompose(lambda x: x.replace("\n", " "))
    review_out = Join()

Then, use it to construct an Item:

def parse(self, response):
    for sel in response.xpath('//*[@id="productReviews"]//tr/td[1]'):
        loader = AmazonItemLoader(item=AmazonItem(), selector=sel)

        loader.add_xpath('rating', './/div/div[2]/span[1]/span/@title')
        loader.add_xpath('date', './/div/div[2]/span[2]/nobr/text()')
        loader.add_xpath('review', './/div/div[6]/text()')
        loader.add_xpath('link', './/div/div[7]/div[2]/div/div[1]/span[3]/a/@href')

        yield loader.load_item()
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you so much for showing me the direction! I think Loader is the way to go. I need to do some fine tuning to have the right layout to suit my needs. I may still come back to you if I am stuck. ;-) – W.S. Apr 30 '15 at 09:13
  • I am stuck again. I edited the original question to reflect the improvement based on your suggestion. still cannot resolve it to the way I like to have. Could you check the question again at above? – W.S. Apr 30 '15 at 12:55