4

☺:

>>> bytes('☺','ibm437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/encodings/cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u263a' in position 0: character maps to <undefined>

As opposed to é, which works:

>>> bytes('é','ibm437')
b'\x82'

I expect ☺ to bring me back b'\x01'. How can I make this the case?

An image of Code Page 437.

Fredrick Brennan
  • 7,079
  • 2
  • 30
  • 61
  • Does `u'☺'.encode('ibm437')` work in Python 3? (It does in Python 2.) – Cameron Jan 27 '13 at 22:45
  • 1
    @Cameron: that is what the OP is doing.. – Martijn Pieters Jan 27 '13 at 22:46
  • Hmm. Seems like `'☺'` is interpreted as `'\x01'` in Python 2 and `'\u263a'` in Python 3. – Cameron Jan 27 '13 at 22:52
  • The IBM 437 codepage codepoints 1-31 are normally control codes (they map one-on-one to ASCII in many cases); only in a video context does that map to the smiley. – Martijn Pieters Jan 27 '13 at 22:53
  • @Cameron For me it maps to `b'\x01'` both in Python2 and Python3. :( – Fredrick Brennan Jan 27 '13 at 22:54
  • @Cameron: `b'\x01'.decode('ibm437')` maps to `'\x01'` in Python 3 too. – Martijn Pieters Jan 27 '13 at 22:54
  • @MartijnPieters Some more context might help: I have many files in this encoding that I'd like to "update" to Unicode so they can be more easily read on modern machines. I was hoping `\x01` would become ☺, but it does not. Is there an alternate way to convert these files (preferably using Python?) – Fredrick Brennan Jan 27 '13 at 22:56
  • From the WP page: *Implementers of translation to Unicode should note that these codes do not have a unique, single Unicode equivalent and the correct choice depends upon context*. I don't think you can provide that context for Python. – Martijn Pieters Jan 27 '13 at 22:56
  • @Martijn: Yes, but the actual character literal `'☺'` evaluates to `'\x01'` in my Python 2 interpreter, but produces the above error (about `\u263a`) in a Python 3 interpreter. Possibly a different default source code encoding somewhere... – Cameron Jan 27 '13 at 23:00
  • @Cameron What version of Python are you using, and on what OS? `u'☺'.encode('ibm437')` raises `UnicodeEncodeError` for me on Python 2.7.3 on Arch Linux. – Fredrick Brennan Jan 27 '13 at 23:02
  • @Cameron: are you perhaps running Python on Windows? – Martijn Pieters Jan 27 '13 at 23:08
  • 1
    Try recoding your file using the [`recode` commandline utility](http://recode.progiciels-bpi.ca/); it has a IBM437 codec too. – Martijn Pieters Jan 27 '13 at 23:09
  • The alternative is to use [`str.translate()`](http://docs.python.org/3/library/stdtypes.html#str.translate) to map each codepoint to a Unicode code point. The Wikipedia article does have a full table for those control codes interpreted in a graphical context including Unicode codepoints. – Martijn Pieters Jan 27 '13 at 23:16
  • @Martijn: Ah, yes I am running on Windows (and the Python 3 prompt I used was online, probably on Linux). That's probably why :-) – Cameron Jan 27 '13 at 23:36

1 Answers1

10

IBM-437 is somewhat special in that it is not only a codepage (i.e. defines what should happen for byte values 128-255), but also redefines some of the ASCII control characters, but only in some contexts. Python maps those problematic codepoints to control characters, and not to the visible characters they were displayed as in some contexts.

To convert, you can use the following helper method:

ibm437_visible = lambda byt: byt.decode('ibm437').translate({
    0x01: "\u263A", 0x02: "\u263B", 0x03: "\u2665", 0x04: "\u2666",
    0x05: "\u2663", 0x06: "\u2660", 0x07: "\u2022", 0x08: "\u25D8",
    0x09: "\u25CB", 0x0a: "\u25D9", 0x0b: "\u2642", 0x0c: "\u2640",
    0x0d: "\u266A", 0x0e: "\u266B", 0x0f: "\u263C", 0x10: "\u25BA",
    0x11: "\u25C4", 0x12: "\u2195", 0x13: "\u203C", 0x14: "\u00B6",
    0x15: "\u00A7", 0x16: "\u25AC", 0x17: "\u21A8", 0x18: "\u2191", 
    0x19: "\u2193", 0x1a: "\u2192", 0x1b: "\u2190", 0x1c: "\u221F",
    0x1d: "\u2194", 0x1e: "\u25B2", 0x1f: "\u25BC", 0x7f: "\u2302",
})
assert ibm437_visible(b'\x01') == '☺'
phihag
  • 278,196
  • 72
  • 453
  • 469
  • thank for the translate table. i have a similar issue as OP: i want to transcode from 437 to utf-8 in java, it seems java doesn't translate 0x01 to `☺` too. – LiuYan 刘研 Aug 06 '14 at 04:39
  • It appears that `decode` maps the byte values 0x00 through 0x1f directly to the identical code point. What I find fascinating is that about half of those code points are very similar to the correct symbols, even if they aren't exact. – Mark Ransom Oct 03 '21 at 06:35