How to convert \xXY encoded characters to UTF-8 in Python?

Question

I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.

I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError. Is there some better way, e.g., with the codecs standard library?

Sample 200 characters here.

Your sample doesn't include any `\xaf` or the like. Do you have any samples with such characters? — dkarp, Jan 19 '11 at 16:15
Your sample data *is* valid UTF-8. With the "record separator" and "unit separator" control characters. — dan04, Jan 20 '11 at 02:05
According to `enca` (http://linux.die.net/man/1/enca) it is UTF-8 "surrounded by/intermixed with non-text data". — Jindřich Mynarz, Jan 21 '11 at 09:10

score 3 · Answer 1 · answered Jan 19 '11 at 14:36

3

.encode is for converting a Unicode string (unicode in 2.x, str in 3.x) to a a byte string (str in 2.x, bytes in 3.x).

In 2.x, it's legal to call .encode on a str object. Python implicitly decodes the string to Unicode first: s.encode(e) works as if you had written s.decode(sys.getdefaultencoding()).encode(e).

The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.

>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'

answered Jan 19 '11 at 14:36

dan04

87,747
23
163
198

That's fine but the rest of the text is encoded as UTF-8 (at least this was reported by `enca`). So this procedure cannot be applied for the whole text. – Jindřich Mynarz Jan 19 '11 at 15:15
2

So the \xXY characters are in ISO-8859-1? – Jindřich Mynarz Jan 27 '11 at 15:44

score 3 · Accepted Answer · answered Feb 13 '11 at 14:28

3

Your file is already a UTF-8 encoded file.

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s\n" % (ord(char), charname))

And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE

answered Feb 13 '11 at 14:28

tzot

92,761
29
141
204

Thanks, you're right the short sample I've provided is UTF-8. however (unfortunately) in the whole file, there are parts encoded in various other encodings (mostly windows-1250). I have solved this by `try`ing to `"string".decode()` for the most common encodings and, if everything failed, guessing the encoding with the `chardet` library. – Jindřich Mynarz Feb 15 '11 at 06:00

score 2 · Answer 3 · answered Jan 19 '11 at 14:22

2

It's not ASCII (ASCII codes only go up to 127; \xaf is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.

Could you provide an actual string sample? Then we can probably guess the current encoding.

answered Jan 19 '11 at 14:22

Tim Pietzcker

328,213
58
503
561

That sample doesn't look like an encoded text to me, more like a proprietary format. – Tim Pietzcker Jan 19 '11 at 15:01
It should be in the MARC format (http://www.loc.gov/marc/). When I tried to detect its encoding with `enca` I got response saying that it's mostly UTF-8 interspersed with non-text characters. – Jindřich Mynarz Jan 19 '11 at 15:12
So it definitely is not a text format/encoding. This is not a problem you can solve with a correct encoding; you need a library that can read this "database". Something [like this](http://www.oss4lib.org/taxonomy/term/67) perhaps. – Tim Pietzcker Jan 19 '11 at 15:16
Yes, I'm already using the `pymarc` library to parse the file. The problem is that it can't parse it correctly because of these characters (\xaf...). So I'm trying to repair the file before passing it to the parser. – Jindřich Mynarz Jan 19 '11 at 15:24

How to convert \xXY encoded characters to UTF-8 in Python?

3 Answers3

Linked