1

I'm having problem on json output of scrapy. Crawler works good, cli output works without a problem. XML item exporter works without a problem and output is saved with correct encoding, text is not escaped.

  • Tried using pipelines and saving the items directly from there.
  • Using Feed Exporters and jsonencoder from json library

These won't work as my data includes sub branches.

Unicode text in json output file is escaped like this: "\u00d6\u011fretmen S\u00fcleyman Yurtta\u015f Cad."

But for xml output file it is correctly written: "Öğretmen Süleyman Yurttaş Cad."

Even changed the scrapy source code to include ensure_ascii=False for ScrapyJSONEncoder, but no use.

So, is there any way to enforce scrapyjsonencoder to not escape while writing to file.

Edit1: Btw, using Python 2.7.6 as scrapy does not support Python3.x

This is as standart scrapy crawler. A spider file, settings file and an items file. First the page list is crawled starting from base url then the content is scraped from those pages. Data pulled from the page is assigned to variables defined in items.py of the scrapy project, encoded in utf-8. There's no problem with that, as everything works good on XML output.

scrapy crawl --nolog --output=output.json -t json spidername

Xml output works without a problem with this command:

scrapy crawl --nolog --output=output.xml -t xml spidername

I have tried editing scrapy/contrib/exporter/init.py and scrapy/utils/serialize.py to insert ensure_ascii=False parameter to json.JSONencoder.

Edit2:

Tried debugging again.There's no problem up to Python2.7/json/encoder.py code. Data is intact and not escaped. After that, it gets hard to debug as the scrapy works async and there are lots of callbacks.

Edit3:

A bit of dirty hack, but after editing Python2.7.6/lib/json/encoder.py and changing ensure_ascii parameter to False, the problem seems to be solved.

Ozcan
  • 492
  • 6
  • 9
  • it would help to include some of your code, as well as the version of python you are using – nthall Jun 19 '15 at 23:56
  • did you try encoding before inserting to db ?? – Jithin Jun 20 '15 at 03:00
  • 3
    Setting `ensure_ascii=False` in the arguments to `ScrapyJSONEncoder` should work fine, post more about what you're doing? – bobince Jun 20 '15 at 09:03
  • @Jithin Data is not saved into a db. Output is just a plain json document. – Ozcan Jun 20 '15 at 10:35
  • @Jithin I'm not using a db. It is a plain text file with json data in it. It is not related to encoding of the text file. The problem is data is escaped. vi, nano, Sublime Text, etc. all show it the same way, as data saved is escaped. – Ozcan Jun 20 '15 at 12:48
  • @Jithin content is the same. It has to be, as it's just escaping the unicode characters. What you are doing there is, using print command unescaping it. – Ozcan Jun 20 '15 at 13:47
  • @bobince ensure_ascii=False worked when I've edited the core json library of the python distribution. It should have worked from scrapy libraries but I'm not sure why it hasn't worked that way. When I have more time, I'll try to debug it and find the real cause. – Ozcan Jun 21 '15 at 00:28
  • I think this is the same question: http://stackoverflow.com/questions/9181214/scrapy-text-encoding/41346276#41346276 – mPrinC Apr 17 '17 at 20:37

2 Answers2

1

As I don't have your code to test, Can you try to use codecs Try: import codecs f = codecs.open('yourfilename', 'your_mode', 'utf-8') f.write('whatever you want to write') f.close()

Dev Pandu
  • 121
  • 2
  • 12
  • As I wrote before, it wasn't about encoding of the variables. There was a problem with file output. I've solved it with a hackish method. Somewhat parameter was not passed to the main json library. – Ozcan Jun 26 '15 at 00:05
  • @OzcanEsnaf how'd you fix it? i'm experiencing a similar issue with the output of Scrapy JSON files. – oldboy Jun 06 '18 at 15:56
  • A lot of time passed since that project. As written on Edit 3, there was a problem with Scrapy calling json encode with parameter, ensure_ascii. I've modified the Python json library as a temp fix. It saved the day. – Ozcan Aug 25 '18 at 18:27
1

Add two parameters to your settings.py as described in the documentation:

FEED = 'json'
FEED_EXPORT_ENCODING = 'utf-8'