13

The following question arose because I was trying to use bytes strings as dictionary keys and bytes values that I understood to be equal weren't being treated as equal.

Why doesn't the following Python code compare equal - aren't these two equivalent representations of the same binary data (the example is knowingly chosen to avoid endianness)?

b'0b11111111' == b'0xff'

I know the following evaluates true, demonstrating the equivalence:

int(b'0b11111111', 2) == int(b'0xff', 16)

But why does Python force me to know the representation? Is it related to endianness? Is there some easy way to force these to compare equivalent other than converting them all to, e.g., hexadecimal literals? Is there a transparent and clear method to move between all representations in a (somewhat) platform independent way (or am I asking too much)?

Say I want to actually index a dictionary using 8 bits in the form b'0b11111111', then why does Python expand it to ten bytes and how do I prevent that?

This is a smaller piece of a large tree data structure and expanding my indexing by a factor of 80 seems like a huge waste of memory.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Matthew Hemke
  • 163
  • 1
  • 1
  • 5

3 Answers3

15

Bytes can represent any number of things. Python cannot and will not guess at what your bytes might encode.

For example, int(b'0b11111111', 34) is also a valid interpretation, but that interpretation is not equal to hex FF.

The number of interpretations, in fact, is endless. The bytes could represent a series of ASCII codepoints, or image colors, or musical notes.

Until you explicitly apply an interpretation, the bytes object consists just of the sequence of values in the range 0-255, and the textual representation of those bytes use ASCII if so representable as printable text:

>>> list(bytes(b'0b11111111'))
[48, 98, 49, 49, 49, 49, 49, 49, 49, 49]
>>> list(bytes(b'0xff'))
[48, 120, 102, 102]

Those byte sequences are not equal.

If you want to interpret these sequences explicitly as integer literals, then use ast.literal_eval() to interpret decoded text values; always normalise first before comparison:

>>> import ast
>>> ast.literal_eval(b'0b11111111'.decode('utf8'))
255
>>> ast.literal_eval(b'0xff'.decode('utf8'))
255
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • But doesn't the `0b` indicate that the `bytes` literal is intended to be a binary representation irregardless of how you interpret it? – Matthew Hemke Jul 19 '14 at 16:58
  • 1
    @MatthewHemke: It just means you have a byte value 48 followed by a byte value 98. These *happen* to be interpretable as ASCII letters `0` and `b`. – Martijn Pieters Jul 19 '14 at 17:01
  • If that's the case, how do I make is so that I get the bytes string to be exactly the 1 byte that I mean? – Matthew Hemke Jul 19 '14 at 17:04
6

b'0b11111111' consists of 10 bytes:

In [44]: list(b'0b11111111')
Out[44]: ['0', 'b', '1', '1', '1', '1', '1', '1', '1', '1']

whereas b'0xff' consists of 4 bytes:

In [45]: list(b'0xff')
Out[45]: ['0', 'x', 'f', 'f']

Clearly, they are not the same objects.

Python values explicitness. (Explicit is better than implicit.) It does not assume that b'0b11111111' is necessarily the binary representation of an integer. It's just a string of bytes. How you choose to interpret it must be explicitly stated.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • The actual byte strings yes, but the data they represent is the same right? Or is it that the byte strings themselves become the value when they are interpreted by e.g. an `int('', base)` evaluation. – Matthew Hemke Jul 19 '14 at 17:01
  • Yes, the byte string is not the same as an integer value. The `int` function converts the bytes to an `int` (and the base must be specified). – unutbu Jul 19 '14 at 17:03
2

It seems that what you were trying to do is get a byte string representing the value 0b11111111 (or 255). This is not what b'0b11111111' does – that actually stands for a byte string representing the character (Unicode) string '0b11111111'.

What you want would be written as b'\xff'. You can check that it is actually one byte: len(b'\xff') == 1.

To convert a Python int to a binary representation, you can use the ctypes library. You need to choose one of the C integer types, e.g.:

>>> bytes(ctypes.c_ubyte(255))
b'\xff'

>>> bytes(ctypes.c_ubyte(0xff))
b'\xff'

>>> bytes(ctypes.c_long(255))
b'\xff\x00\x00\x00\x00\x00\x00\x00'

Note: Instead of c_ubyte and c_long, you can use the aliases c_uint8 (i.e. 8-bit unsigned C integer) and c_int64 (64-bit signed C integer), respectively.

To convert back:

>>> ctypes.c_ubyte.from_buffer_copy(b'\xff').value
255

Be careful about overflow:

>>> ctypes.c_ubyte(256)
c_ubyte(0)
ondra.cifka
  • 755
  • 1
  • 9
  • 17