6

I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.

I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError. Is there some better way, e.g., with the codecs standard library?

Sample 200 characters here.

Community
  • 1
  • 1
Jindřich Mynarz
  • 1,563
  • 1
  • 16
  • 31
  • Your sample doesn't include any `\xaf` or the like. Do you have any samples with such characters? – dkarp Jan 19 '11 at 16:15
  • Your sample data *is* valid UTF-8. With the "record separator" and "unit separator" control characters. – dan04 Jan 20 '11 at 02:05
  • According to `enca` (http://linux.die.net/man/1/enca) it is UTF-8 "surrounded by/intermixed with non-text data". – Jindřich Mynarz Jan 21 '11 at 09:10

3 Answers3

3

.encode is for converting a Unicode string (unicode in 2.x, str in 3.x) to a a byte string (str in 2.x, bytes in 3.x).

In 2.x, it's legal to call .encode on a str object. Python implicitly decodes the string to Unicode first: s.encode(e) works as if you had written s.decode(sys.getdefaultencoding()).encode(e).

The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.

>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'
dan04
  • 87,747
  • 23
  • 163
  • 198
3

Your file is already a UTF-8 encoded file.

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s\n" % (ord(char), charname))

And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE

tzot
  • 92,761
  • 29
  • 141
  • 204
  • Thanks, you're right the short sample I've provided is UTF-8. however (unfortunately) in the whole file, there are parts encoded in various other encodings (mostly windows-1250). I have solved this by `try`ing to `"string".decode()` for the most common encodings and, if everything failed, guessing the encoding with the `chardet` library. – Jindřich Mynarz Feb 15 '11 at 06:00
2

It's not ASCII (ASCII codes only go up to 127; \xaf is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.

Could you provide an actual string sample? Then we can probably guess the current encoding.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • That sample doesn't look like an encoded text to me, more like a proprietary format. – Tim Pietzcker Jan 19 '11 at 15:01
  • It should be in the MARC format (http://www.loc.gov/marc/). When I tried to detect its encoding with `enca` I got response saying that it's mostly UTF-8 interspersed with non-text characters. – Jindřich Mynarz Jan 19 '11 at 15:12
  • So it definitely is not a text format/encoding. This is not a problem you can solve with a correct encoding; you need a library that can read this "database". Something [like this](http://www.oss4lib.org/taxonomy/term/67) perhaps. – Tim Pietzcker Jan 19 '11 at 15:16
  • Yes, I'm already using the `pymarc` library to parse the file. The problem is that it can't parse it correctly because of these characters (\xaf...). So I'm trying to repair the file before passing it to the parser. – Jindřich Mynarz Jan 19 '11 at 15:24