Python read() automatically converts hex to char?

Question

I'm trying to convert a 4x4, 5.6.5.0.0, .bmp file into a list of rgb values to plug into another program that needs a specific format, and I'm getting stuck because I think the read() method in Python is converting some of the data before I can use it, even when I open it in "rb" mode.

For example, when i use:

f = open("imgFile.bmp", "rb")
imgData=f.read()
f.close()

print imgData

I get:

BMh\x00\x00\x00\x00\x00\x00\x006\x00\x00\x00(\x00\x00\x00\x04\x00\x00\x00\xfc\xff\xff\xff\x01\x00\x18\x00\x00\x00\x00\x002\x00\x00\x00\x12\x0b\x00\x00\x12\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xcc\xbb\xaa\xff\xee\xdd\x00\x00\x00\xff\xff\xff\xdd\xcc\xbb\x00\x00\x00\xff\xff\xff\x00\x00\x00\x00\x00\x00\xff\xff\xff\x00\x00\x00\xff\xff\xff\xff\xff\xff\x00\x00\x00\xff\xff\xff3"\x11\x00\x00

Which is fine for the most part (I can grab the hex values I need after the bmp header—those values start at "\xcc\xbb\xaa . . ." But it looks like some hex values are being interpreted as other characters and symbols, which at least make it harder to translate, but at worst result in ambiguity that makes it impossible to recover the original data with certainty.

For instance, you'll find this sequence near the end of the string:

\xff3"\x11

which should appear as:

\xff\x33\x22\x11

(This table shows that '33' can be interpreted as '3', '22' as '"', and I'm certain that it should be that way—see how the data appears in the text editor below).

Now, it would be easy to translate all the symbols back into the hex format if there were no ambiguities, but there are many possibilities in more complex files. For instance, if I have the sequence '6666' it would just be changed into 'ff', which I would be unable to tell appart from instances of 'ff' that I might already have in my data.

My question is: how do I keep the data untranslated and unambiguous for further parsing and formatting in Python?

To confirm that what I've described is happening, I've opened the file in SublimeText, where it appears as this:

424d 6800 0000 0000 0000 3600 0000 2800 0000 0400 0000 fcff ffff 0100 1800 0000 0000 3200 0000 120b 0000 120b 0000 0000 0000 0000 0000 ccbb aaff eedd 0000 00ff ffff ddcc bb00 0000 ffff ff00 0000 0000 00ff ffff 0000 00ff ffff ffff ff00 0000 ffff ff33 2211 0000

, which is correct and usable (though not efficient for my purposes, to have to open in a text editor every time), so i would like to automate the process with Python.

Incidentally, I think this may be what was happening for this person, too.

You *never* used `print imgData`, btw. You used `imgData` in the python interpreter prompt, perhaps, or `print repr(imgData)`. If you did do `print imgData` you would have gotten the raw data instead. — Martijn Pieters, Feb 23 '13 at 14:30
you could use [pillow](https://pypi.python.org/pypi/Pillow/) to read a bmp image, [example](http://stackoverflow.com/a/8678604/4279) — jfs, Feb 23 '13 at 14:54
@MartijnPieters whoops—that's right, re: using imgData in the python interpreter prompt and not print imgData. — user2102427, Feb 23 '13 at 21:43
@J.F.Sebastian Thanks—though I ended up just going with the regular PIL library, loading the image and getting the pixels by their x,y position. — user2102427, Feb 23 '13 at 21:57

Martijn Pieters · Accepted Answer · 2013-02-23T14:34:49.817

Python shows you a literal string value, and uses escape codes to prevent your terminal from going haywire. Anything that is not a printable ASCII character is shown as a escape code instead.

The value itself is still fully binary.

>>> '\x00'
'\x00'
>>> len('\x00')
1
>>> '\x65'
'e'

In the above example, the null byte is displayed as a \x00 escape code, but it is still only one byte (length 1). A byte with hex value 65 is displayed as an e because it is a printable ASCII character.

Python read() automatically converts hex to char?

1 Answers1