Python converting code page character number to unicode

Question

By default, print(chr(195)) displays the unicode character at position 195 ("Ã") How do I print chr(195) that appears in code page 1251, ie. "Г" I tried: print(chr(195).decode('cp1252')), and various .encode methods.

Thanks to everyone's help, I now have my program to print code pages:

# Print selected Code Pages named at https://docs.python.org/3.6/library/codecs.html#standard-encodings
# Ian Tresman. 10 November 2018.

codepages=['cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856',
           'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875',
           'cp932', 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255',
           'cp1256', 'cp1257', 'cp1258',  'latin_1', 'iso8859_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5',
           'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14',
           'iso8859_15', 'iso8859_16', 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland',
           'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154']

for codepage in codepages:                             #Select each code page in turn
    print(" "*12 + "Codepage: ", codepage)             #Indented code page name
    print("   | 0 1 2 3 4 5 6 7 8 9 A B C D E F")      #Code page columns, A=10, B=11 etc
    print("   " + "-"*33)                              #Horizontal rule
    for row in range(32,255,16):                       #For each row (ignore control characters < 32)
        print(f"{row:3}:",end= ' ')                    #Print row code
        for col in range(0,16):                        #For each column
            char=row+col                               #Calculate character number (similar to ascii code)
            try:                                       #Try to get a unicode equivalent of a specific byte value:
                unichar=bytes([char]).decode(codepage) #Fails with non-mappable characters, and some control characters
            except:                                    
                unichar=" "                             #If there was no unicode, use a space

            if not (unichar.isprintable()): unichar=" " #If the unicode is not printable, use a space
            print(unichar, end=' ')
        print()                       #End of row break
    print()                           #End of code page spacing
    input("")                         #Pause after each code page, press Enter to continue

Thanks for everyone's help, I now have my program to print code pages: https://trinket.io/python3/f269e4371b — iantresman, Nov 10 '18 at 16:57
Nice! Visually confirmed this looks VERY similar to [The Wikipedia version for CP1252](https://en.wikipedia.org/wiki/Windows-1252). — Josiah Yoder, Jul 25 '23 at 16:10
Perhaps you make the code from trinket.io (which I pasted into your question) your own answer to this question? — Josiah Yoder, Jul 25 '23 at 16:14

score 2 · Answer 1 · answered Nov 04 '18 at 23:41

Since you cannot store a 'raw' value 0xC3 in a string (and if you did, you should not have – raw binary "unparsed" data should be a byte array): the proper way to convert from a raw byte array is indeed .decode('cp1251'):

>>> print (b'\xc3'.decode('cp1251'))
Г

However, if you already got it in a string, then the easiest is to first convert from a string to a bytes object using the 1-on-1 "encoding" Latin-1:

str = 'Ãamma'
print (bytes(str.encode('latin1')).decode('cp1251'))
>>> Гamma

score 2 · Answer 2 · answered Nov 05 '18 at 13:46

In Python 3, chr(n) returns a Unicode string, which can only be encoded. Use bytes to create byte strings that can be decoded:

>>> bytes([195])
b'\xc3'
>>> bytes([195]).decode('cp1251')
'Г'
>>> bytes([195,196,197])
b'\xc3\xc4\xc5'
>>> bytes([195,196,197]).decode('cp1251')
'ГДЕ'

Abhishek Patel · Answer 3 · 2018-11-06T19:05:09.037

1

You can use urllib

print urllib.quote_plus(str.encode('cp1251'))

Also remember, if you are using international strings, make sure to include the u prefix in your string that you are parsing.

str = u"whateverhere"

changed to remove downvote??

edited Nov 06 '18 at 19:05

answered Nov 04 '18 at 23:33

Abhishek Patel

587
2
8
25

Sorry, didn't mean to downvote your answer, and it won't let me change it. – iantresman Nov 05 '18 at 09:46
reclick the downvote button to remove the vote, or click the upvote to change your vote @iantresman – Abhishek Patel Nov 05 '18 at 21:29
Yes I tried both of those. It indicates that if you edit your answer, I might be able to change the vote. – iantresman Nov 06 '18 at 07:43
@iantresman just edited, this is the first time I've heard of this though – Abhishek Patel Nov 06 '18 at 19:05

Python converting code page character number to unicode

3 Answers3