python: unicode problem

Question

I am trying to decode a string I took from file:

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00 \x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00 \x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00 \x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00 \x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00 \x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00 \x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00 \x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00 \x002\x000\x001\x000\x00\t\x00A\x00d\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00 \x00A\x00v\x00g\x00.\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00 \x00F\x00r\x00o\x00m\x00 \x00W\x00e\x00b\x00 \x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'

Adding ignore do not really help...:

In [69]: data[2] Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'

In [70]: data[2].decode("utf-8", "replace") --------------------------------------------------------------------------- Traceback (most recent call last)

/Users/oleg/ in ()

/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder):

: 'ascii' codec can't encode characters in position 0-87: ordinal not in range(128)

In [71]:

My answer works without the error. But it depends wether you want to ignore or replace the undecodeable characters. — orlp, Jan 19 '11 at 13:21

Sven Marnach · Accepted Answer · 2011-01-19T13:30:58.933

20

This looks like UTF-16 data. So try

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.

edited Jan 19 '11 at 13:30

answered Jan 19 '11 at 13:10

Sven Marnach

574,206
118
941
841

Strange, it fails for next line: – Oleg Tarasenko Jan 19 '11 at 13:16

tzot · Answer 2 · 2011-02-13T14:15:08.820

11

This file is a UTF-16-LE encoded file, with an initial BOM.

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

edited Feb 13 '11 at 14:15

answered Feb 13 '11 at 11:42

tzot

92,761
29
141
204

-1 balderdash. >>> raw = '\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00' >>> raw.decode('utf_16le') u'\ufeffKeyword' >>> raw.decode('utf_16') u'Keyword' >>> – John Machin Feb 13 '11 at 12:17

orlp · Answer 3 · 2011-01-19T13:20:19.427

3

EDIT

Since you posted 2.7 this is the 2.7 solution:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

edited Jan 19 '11 at 13:20

answered Jan 19 '11 at 13:08

orlp

112,504
36
218
315

In [21]: file = open ("./Downloads/lamp-post.csv", 'r') In [22]: data = [line.decode() for line in file] --------------------------------------------------------------------------- Traceback (most recent call last) /Users/oleg/ in () : 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) In [23]: data = [line.decode() for line in file] – Oleg Tarasenko Jan 19 '11 at 13:10
Ohh, do you want to ignore those invalid characters or replace them? Edited my answer assuming replacement. – orlp Jan 19 '11 at 13:11
In Python 3, files are opened in unicode mode by default. So they will not have a decode method. – Thomas K Jan 19 '11 at 13:13
1

I undid the downvote. But there's still a better way in Python 3: use the encoding argument for open. `open("Downloads/lamp-post.csv", encoding="utf-16")`. – Thomas K Jan 19 '11 at 13:17
Strange data do not seem to be changed... e.g. I see same array of utf-16 calling data – Oleg Tarasenko Jan 19 '11 at 13:22
@Oleg: Are you sure it's 2.7? `/opt/local/lib/python2.5/` ? – Thomas K Jan 19 '11 at 13:26
see update on question... Auch it's 2.5: silver:~ oleg$ ipython Python 2.5.1 (r251:54863, Feb 22 2008, 16:52:17) Type "copyright", "credits" or "license" for more information. – Oleg Tarasenko Jan 19 '11 at 13:31

python: unicode problem

3 Answers3

Linked