I need need to scrape text data from sites using languages other than English (mostly Eastern European langs), using Scrapy. When Scrapy finishes, it needs to convert scraped data to JSON for further use.
The thing is, if I just scrape the text like this:
i['title'] = response.xpath('//home/title//text()').extract_first()
without encoding it, Scrapy throws something like this:
UnicodeEncodeError: 'charmap' codec can't encode character '\u0107' in position 103: character maps to <undefined>
On the other hand, if I do encode it, and try to process that with json.dumps(), I get a TypeError, since json can't serialize bytes. I've seen this explanation (How to encode bytes in JSON? json.dumps() throwing a TypeError), but its of little use, since I need to use utf-8 or utf-16, and not ascii.
Any idea how to solve this?