42

Here is my spider

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem

class vriskoSpider(CrawlSpider):
    name = 'vrisko'
    allowed_domains = ['vrisko.gr']
    start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        vriskoit = VriskoItem()
        vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
        vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
        return vriskoit

My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.

Thank you in advance!

mindcast
  • 747
  • 1
  • 9
  • 12

8 Answers8

112

Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING is introduced. By specifying it as utf-8, JSON output will not be escaped.

That is to add in your settings.py:

FEED_EXPORT_ENCODING = 'utf-8'
Lacek
  • 1,595
  • 2
  • 11
  • 30
38

Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:

vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]

But I think that you expect another result. Your code return one item with all search results. To return items for each result:

hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
                             hxs.select("//div[@class='results_address_class']/text()").extract()):
    vriskoit = VriskoItem()
    vriskoit['eponimia'] = eponimia.encode('utf-8')
    vriskoit['address'] = address.encode('utf-8')
    yield vriskoit

Update

JSON exporter writes unicode symbols escaped (e.g. \u03a4) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False (see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.

So if you want exported items to be written in utf-8 encoding, e.g. for read them in text editor, you can write custom item pipeline.

pipelines.py:

import json
import codecs

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

Don't forget to add this pipeline to settings.py:

 ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']

You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline is just basic example.

reclosedev
  • 9,352
  • 34
  • 51
  • I did what you ve written but i still get the same results: unicode characters. The only way to get utf-8, is to use print vrisko['eponimia'] instead of yield or return. – mindcast Feb 08 '12 at 16:07
  • @mindcast, Where did you get it? What do you do with items (saving to json feed, csv feed, or maybe custom pipeline)? – reclosedev Feb 08 '12 at 17:03
  • scrapy crawl vrisko -o scraped_data.json -t json or even scrapy crawl vrisko and see the results on my screen. I know i miss something but i can't figure it out. Thank you for your effort. – mindcast Feb 08 '12 at 19:24
  • i get this error: "line = json.dump(dict(item), ensure_ascii=False) exceptions.TypeError: dump() takes at least 2 arguments (2 given)" – mindcast Feb 09 '12 at 15:16
  • i think i found 2 mistakes, so i finally have my json file with utf-8 characters. First one, json.dumps and second one in codecs.open needs 'w' to have permission to write. Thank you very much. :) – mindcast Feb 09 '12 at 15:37
  • After utf-8 characters in the file, there are the unicode characters too. How can i get rid of them ? – mindcast Feb 09 '12 at 15:45
  • 1
    @mindcast, looks like data written twice by custom pipeline and standard json exporter. Are you using `scrapy crawl vrisko` command? There is no need to use `-o` option. – reclosedev Feb 09 '12 at 16:01
  • oh,thank you,great! I didn't know this. Thanks for opening my eyes. :). Can you suggest me any book/tutorial for scrapy because i think documentation is poor. – mindcast Feb 09 '12 at 16:04
  • @mindcast, I think documentation is very good. For spiders examples check http://snippets.scrapy.org/ – reclosedev Feb 09 '12 at 16:12
  • One last question: how to avoid duplicated eponimia,address labels in my json file per page that has been crawled ? Is there a way to group them all together per label ? – mindcast Feb 09 '12 at 16:19
  • @mindcast, there is [example in documentation](http://readthedocs.org/docs/scrapy/en/0.14/topics/item-pipeline.html#item-pipeline-example-with-resources-per-spider) – reclosedev Feb 09 '12 at 16:29
10

Try adding the following line to the config file for Scrapy (i.e. settings.py):

FEED_EXPORT_ENCODING = 'utf-8'
ktdrv
  • 3,602
  • 3
  • 30
  • 45
FreeCat
  • 101
  • 1
  • 3
4

I had a lot of problem due to encoding with python and scrapy. To be sure to avoid every encoding decoding problems, the best thing to do is to write :

unicode(response.body.decode(response.encoding)).encode('utf-8')
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
mikeulkeul
  • 49
  • 1
3

You should add the statement FEED_EXPORT_ENCODING = 'utf-8' to the setting file in your scrapy project.

Niklaus
  • 31
  • 3
2

I find a simple way to do that. It saves json data to 'SpiderName'.json with 'utf8'

from scrapy.exporters import JsonItemExporter

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = open(spider.name + '.json', 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
1

Now I can pass this setting as command-line parameter

>>>scrapy runspider blah.py -o myjayson.json -s FEED_EXPORT_ENCODING=utf-8
dejjub-AIS
  • 1,501
  • 2
  • 24
  • 50
0

As was mentioned earlier, JSON exporter writes unicode symbols escaped and it has option to write them as unicode ensure_ascii=False.

To export items in utf-8 encoding you can add this to your project's settings.py file:

from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
    def __init__(self, file, **kwargs):
        super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

FEED_EXPORTERS = {
    'jsonlines': 'yourproject.settings.MyJsonLinesItemExporter',
    'jl': 'yourproject.settings.MyJsonLinesItemExporter',
}

Then run:

scrapy crawl spider_name -o output.jl
banzayats
  • 21
  • 4