0

I have lists with Unicode:

words
[u'\xd1', u'\xd0\xb0', u'\xd0\xb8', u'\u043e', u'\xd1\x81', u'-', u'\xd0\xb2', u'\u0438', u'\u0441', u'\xd0\xb8\xd1', u'\xd1\x83', u'\u0432', u'\u043a', u'\xd0\xba', u'\xd0\xbf\xd0\xbe', u'|', u'search', u'\xd0\xbd\xd0\xbe', u'25', u'in', u'\xd0\xbd\xd0\xb0', u'\u043d\u0430', u'\xd0\xbd\xd0\xb5', u'\xd0\xbe\xd0\xb1', u'\xd0\xbe\xd1\x82', u'\u043f\u043e', u'google', u'\xd0\x92', u'---', u'##']
[u'\u043e', u'\u0438', u'-', u'\u0441', u'\u0432', u'\u043a', u'\u0430', u'ebay', u'\u043d\u0430', u'\u0443', u'\u0442\u043e', u'"', u'33', u'**', u'ebay.', u'\u043f\u043e', u'jeans', u'at', u'\u0442\u043e\u0432\u0430\u0440', u'\u0434\u0436\u0438\u043d\u0441\u044b', u'\u0442\u043e\u0432\u0430\u0440\u043e\u0432', u'\u041a\u043e\u043b\u043b\u0435\u043a\u0446\u0438\u044f', u'\u043d\u0430\u0437\u0432\u0430\u043d\u0430', u'\u043e\u0442', u'tan', u'\u0432\u044b', u'altanbataev0', u'32', u'\u043d\u043e', u'&']
[u'\u043e', u'/', u'\u0430', u'-', u'\u0438', u'\u0441', u'\u0432', u'\u043a', u'\u0443', u'\u044f', u'\u043d\u043e', u'\u043f\u043e', u'\u0442\u043e', u'\u043d\u0430', u'\u043e\u0442', u'!', u'\u043d\u0435', u'"', u'\u043d\u0438', u'\u043a\u043e', u'\u0442\u0435\u0441\u0442', u'\u0437\u0430', u'\u043e\u043d']

I tried [x.encode('latin-1') for x in lst] but it returns:

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u043e' in position 0: ordinal not in range(256)

I also tried cp1252 and utf8, but they also return an error.

martineau
  • 119,623
  • 25
  • 170
  • 301
Petr Petrov
  • 4,090
  • 10
  • 31
  • 68
  • You can't encode unicode using cp1252 or latin-1 but utf-8 should be OK, and according to my tests, it is actually OK. On my machine `print([x.encode('utf-8') for x in lst])` worked for each list. – Tryph Oct 12 '16 at 12:24
  • @Tryph but how can I next convert it to `latin-1`? – Petr Petrov Oct 12 '16 at 13:24
  • 1
    according to your question title, I assume those lists contain cyrillic characters. Latin-1 encoding does not code cyrillic characters, so you will not be able to encode cyrillic characters with this encoding. – Tryph Oct 12 '16 at 14:21
  • @Tryph I need to get russian text – Petr Petrov Oct 12 '16 at 14:24
  • you can try cp1251: https://en.wikipedia.org/wiki/Windows-1251. – Tryph Oct 12 '16 at 14:29
  • `print ([[xx for xx in word] for word in words]);` works in _Python 3.5.1_ /windows 8.1/. – JosefZ Oct 12 '16 at 15:40
  • @JosefZ It might ”work” but doesn't make much sense. – BlackJack Oct 13 '16 at 14:04

1 Answers1

1

You have Russian already (at least some of it), you just need to print the strings, not the list, on an IDE/terminal that supports Russian characters. Here's an excerpt, printed with Python 2.7 on a UTF-8 terminal:

L = [u'\u0442\u043e\u0432\u0430\u0440', u'\u0434\u0436\u0438\u043d\u0441\u044b']

print L

for s in L:
    print s

Output:

[u'\u0442\u043e\u0432\u0430\u0440', u'\u0434\u0436\u0438\u043d\u0441\u044b']
товар
джинсы
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251