2

(I am working in python)

Suppose I have this list of integers

a = [170, 140, 139, 180, 225, 200]

and I want to find the raw byte representation of the ASCII character each integer is mapped to. Since these are all greater than 127, they fall in the Extended ASCII set. I was originally using the chr() method in python to get the character and then encode() to get the raw byte representation.

a_bytes = [chr(decimal).encode() for decimal in a]

Using this method, I saw that for numbers greater than 127, the corresponding ASCII character is represented by 2 bytes.

[b'\xc2\xaa', b'\xc2\x8c', b'\xc2\x8b', b'\xc2\xb4', b'\xc3\xa1', b'\xc3\x88']

But when I used the bytes() method, it appears that each character had one byte.

a_bytes2 = bytes(a)
>>> b'\xaa\x8c\x8b\xb4\xe1\xc8'

So why is it different when I use chr().encode() versus bytes()?

nullb12
  • 21
  • 3
  • If you don't specify an encoding, `encode()` will use UTF-8, which will use two bytes for these characters. Try `encode('latin-1')`. – snakecharmerb Oct 22 '21 at 15:31
  • Python doesn't use any of the several dozen "extended ASCII" character sets. It uses Unicode. The `chr()` function gives you the character corresponding to a Unicode codepoint. If you call `encode()` on a character with a codepoint above 127, you will get the default encoding, which is UTF-8. Codepoints in the range 128-255 can have a representation of more than one byte in UTF-8. If you want to see a one-byte representation, encode the codepoint using a charmap encoding such as Windows-1252, not UTF-8. – BoarGules Oct 22 '21 at 17:58

1 Answers1

1

There is no such thing as "Extended ASCII". ASCII is defined as bytes (and code points) in the range 0-127. Most standard single-byte code pages (which are used to convert from bytes to code points) use ASCII for bytes 0-127 and then map 128-255 to whatever is convenient for the code page. Russian code pages map those bytes to Cyrillic code points for example.

In your example, .encode() defaults to the multi-byte UTF-8 encoding which maps the code points 0-127 to ASCII and follows multibyte encoding rules for any code point above 128. chr() converts an integer to its corresponding, fixed Unicode code point.

So you have to choose an appropriate encoding to see what a byte in that encoding represents as a character. As you can see below, it varies:

>>> a = [170, 140, 139, 180, 225, 200]
>>> ''.join(chr(x) for x in a)  # Unicode code points
'ª\x8c\x8b´áÈ'
>>> bytes(a).decode('latin1')   # ISO-8859-1, also matches first 256 Unicode code points.
'ª\x8c\x8b´áÈ'
>>> bytes(a).decode('cp1252')   # USA and Western Europe
'ªŒ‹´áÈ'
>>> bytes(a).decode('cp1251')   # Russian, Serbian, Bulgarian, ...
'ЄЊ‹ґбИ'
>>> bytes(a).decode('cp1250')   # Central and Eastern Europe
'ŞŚ‹´áČ'
>>> bytes(a).decode('ascii')  # these bytes aren't defined for ASCII
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xaa in position 0: ordinal not in range(128)

Also, when displaying bytes the Python default is to display printable ASCII characters as characters and anything else (unprintable control characters and >127 values) as escape codes:

>>> bytes([0,1,2,97,98,99,49,50,51,170,140,139,180,225,200])
b'\x00\x01\x02abc123\xaa\x8c\x8b\xb4\xe1\xc8'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251