21

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?

I use the following command line to get CSV data:

scrapy crawl somwehere -o items.csv -t csv

According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.

Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:

to address some previous issues, that seem to have already been resolved...

Many thanks in advance.

Community
  • 1
  • 1
not2qubit
  • 14,531
  • 8
  • 95
  • 135

3 Answers3

29

To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = [list with Names of fields to export - order is important]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

Of course you need to remember to add this pipeline in your configuration file (settings.py):

ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
not2qubit
  • 14,531
  • 8
  • 95
  • 135
ErdraugPl
  • 394
  • 3
  • 8
  • Yes, that's what I want, but in what files do I put this code? I assume that, what you refer to as "configuration" file, should be *settings.py*? And the first one, would it be *pipelines.py* by any chance? – not2qubit Dec 25 '13 at 12:20
  • So I tried that. First I got: "*Error loading object 'myproject.pipeline.CSVPipeline': No module named pipeline*" then I changed to "pipelines" and got a new error: "*NameError: global name 'signals' is not defined*" – not2qubit Dec 25 '13 at 12:53
  • 1
    Thanks! After fixing your typo and adding `from scrapy import signals`, it works. It's surprising that we should need all this additional code, just to have scrapy output CSV items in normal order, as specified in **item.py**. Why is this not fixed? – not2qubit Dec 25 '13 at 16:45
  • also worth pointing out that this'll create a csv called my project_items.csv and ignore whatever you name the csv when executing scrapy... apart from that, thanks! :) – Sam T Feb 07 '14 at 22:14
  • 9
    Since Scrapy v. 1.0 you can set FEED_EXPORT_FIELDS in your settings file the same way as the 'fields_to_export' defined. So, you don't need the custom pipeline in this case. – bubble Dec 26 '15 at 03:56
  • 1
    @bubble That was the answer i was looking for. – Krishh May 18 '16 at 10:57
13

You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider

To set the field order for exported feeds, set FEED_EXPORT_FIELDS. https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

The spider below dumps all links on a website (written against Scrapy 1.4.0):

import scrapy
from scrapy.http import HtmlResponse

class DumplinksSpider(scrapy.Spider):
  name = 'dumplinks'
  allowed_domains = ['www.example.com']
  start_urls = ['http://www.example.com/']
  custom_settings = {
    # specifies exported fields and order
    'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
  }

  def parse(self, response):
    if not isinstance(response, HtmlResponse):
      return

    a_selectors = response.xpath('//a')
    for i, a_selector in enumerate(a_selectors):
      text = a_selector.xpath('normalize-space(text())').extract_first()
      url = a_selector.xpath('@href').extract_first()
      yield {
        'page_ix': i + 1,
        'page': response.url,
        'text': text,
        'url': url,
      }
      yield response.follow(url, callback=self.parse)  # see allowed_domains

Run with this command:

scrapy crawl dumplinks --loglevel=INFO -o links.csv

Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.

Mat Gessel
  • 562
  • 8
  • 17
-1

I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:


    class ItemInfo(scrapy.Item):
            item = scrapy.Field()
            price = scrapy.Field()
            A_UPC = scrapy.Field()
            ID = scrapy.Field()
            time = scrapy.Field()

My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time

  • Hi Jack and welcome to SO! Unfortunately your answer is not addressing the question, as it was asking *How* to order the output, whereas you only accepted the alphabetical default order. If you need to programmatically use the resulting CSV, specifying the order is obviously very important. – not2qubit Mar 31 '21 at 10:45