I have a binary file with text in it. I need to grab the hex bytes of the string, and convert them to readable text. I'm using Python 3.
The encoding appears to be UTF-8, but I've been having some trouble decoding some specific strings. You see, some strings appear to have unicode code points to represent characters, where others use their hex values corresponding to their entry in the UTF-8 character table. Here's an example:
4D 6F 6A 65 20 53 70 6F 72 65 20 76 FD 74 76 6F 72 79 -> Moje Spore výtvory
The FD
byte in that string represents the ý
character, however FD only corresponds to this character if we check the Unicode Code Point in the character table as seen here:
You can see that to represent this character in hex, you need two bytes. This wouldn't be a problem if all strings behaved like this, but some others actually use the hex values to represent characters as seen here:
4C 45 47 4F C2 AE -> LEGO®
In this example, the two bytes C2 AE
represent the ®
character. However this is their hex representation and not the Unicode Code Point as seen here:
Now here's the problem. I have no way to tell when a string is using Unicode Code Points and when it's using Hex values, and I need this to be parsed perfectly. Any ideas as to why this might be the case? Python crashes if I try to decode this with UTF-8, as when it reaches a value like FD
, it doesn't know what to do. I tried to decode byte by byte with the ord()
and chr()
functions, but while this prevents crashing, it makes multi-byte characters have extra stuff that doesn't belong (for instance, the LEGO example would look like this: LEGO®
). The data needs to be perfectly parsed because it must be used to produce a checksum, so even invisible stuff would change the outcome. Thanks.