2

I need need to scrape text data from sites using languages other than English (mostly Eastern European langs), using Scrapy. When Scrapy finishes, it needs to convert scraped data to JSON for further use.

The thing is, if I just scrape the text like this:

i['title'] = response.xpath('//home/title//text()').extract_first()

without encoding it, Scrapy throws something like this:

UnicodeEncodeError: 'charmap' codec can't encode character '\u0107' in position 103: character maps to <undefined>

On the other hand, if I do encode it, and try to process that with json.dumps(), I get a TypeError, since json can't serialize bytes. I've seen this explanation (How to encode bytes in JSON? json.dumps() throwing a TypeError), but its of little use, since I need to use utf-8 or utf-16, and not ascii.

Any idea how to solve this?

D_rock
  • 169
  • 1
  • 7
  • 1
    Posting your entire log output would be much more useful than a single line of a traceback. – stranac Dec 08 '18 at 15:14
  • The issue might be that data is not getting written into the file as UTF-8. Without seeing your code I can only speculate. But if this is the case you need to add: open('filename', 'w', encoding='utf-8') as f: – RedCrusador Dec 11 '18 at 10:38
  • 1
    @D_rock please share the site you are trying to scrape or a piece of the troublesome html – eLRuLL Dec 22 '18 at 13:39

1 Answers1

0

have you taken a look at the response headers? What encoding does it tell you? I can imagine that it tells you another encoding than it actually is.

Pythons decoding function has a parameter error ('strict', 'replace', 'ignore') which you can use to debug and find the problem'

Sorry this more a comment than an answer but i cant comment yet (too less rep)

Raphael
  • 1,731
  • 2
  • 7
  • 23