How can I parse a binary file that seems to use a mix of Unicode Code Points and Hex values for strings?

Question

I have a binary file with text in it. I need to grab the hex bytes of the string, and convert them to readable text. I'm using Python 3.

The encoding appears to be UTF-8, but I've been having some trouble decoding some specific strings. You see, some strings appear to have unicode code points to represent characters, where others use their hex values corresponding to their entry in the UTF-8 character table. Here's an example:

4D 6F 6A 65 20 53 70 6F 72 65 20 76 FD 74 76 6F 72 79 -> Moje Spore výtvory

The FD byte in that string represents the ý character, however FD only corresponds to this character if we check the Unicode Code Point in the character table as seen here:

You can see that to represent this character in hex, you need two bytes. This wouldn't be a problem if all strings behaved like this, but some others actually use the hex values to represent characters as seen here:

4C 45 47 4F C2 AE -> LEGO®

In this example, the two bytes C2 AE represent the ® character. However this is their hex representation and not the Unicode Code Point as seen here:

Now here's the problem. I have no way to tell when a string is using Unicode Code Points and when it's using Hex values, and I need this to be parsed perfectly. Any ideas as to why this might be the case? Python crashes if I try to decode this with UTF-8, as when it reaches a value like FD, it doesn't know what to do. I tried to decode byte by byte with the ord() and chr() functions, but while this prevents crashing, it makes multi-byte characters have extra stuff that doesn't belong (for instance, the LEGO example would look like this: LEGOÂ®). The data needs to be perfectly parsed because it must be used to produce a checksum, so even invisible stuff would change the outcome. Thanks.

No idea, if you want more specifics, the file is "appinfo.vdf". It's from Steam (the videogame storefront). Located in the "appcache" folder inside the installation directory. It has a bunch of dictionaries, and some of the keys have strings, these are the strings I'm trying to decode. I have myself verified that the strings are indeed being correctly detected and are not grabbing garbage, so I'm not parsing them wrong, there just seems to be something funky going on. — tralph3, Jan 17 '21 at 23:40

score 1 · Accepted Answer · answered Jan 18 '21 at 01:40

The first string (with FD) is not UTF-8-encoded. It is likely ISO-8859-1 or Windows-1252. The byte representing ý happens to match the Unicode code point value, but it is not using "[U]nicode code points to represent characters".

The LEGO string is UTF-8-encoded. If you are hacking strings from files and don't have a specification, you just have to guess. UTF-8 has to follow specific rules for its multi-byte encoding, so decoding is likely to fail if you try UTF-8 first and it isn't UTF-8. You could then fallback to ISO-8859-1. The latter will decode anything, even if it isn't that encoding. You may end up with garbage.

Example for UTF-8 encoding:

>>> s='Moje Spore výtvory'.encode('utf8')
>>> s
b'Moje Spore v\xc3\xbdtvory'
>>> s.hex()
'4d6f6a652053706f72652076c3bd74766f7279'
>>> s.decode('utf8')
'Moje Spore výtvory'
>>> s.decode('iso-8859-1')  # note it works, but garbage
'Moje Spore vÃ½tvory'

If the string is ISO-8859-1-encoded:

>>> s='Moje Spore výtvory'.encode('latin1') # an alias for ISO-8859-1
>>> s.hex()
'4d6f6a652053706f72652076fd74766f7279'
>>> s.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 12: invalid start byte
>>> s.decode('latin1')
'Moje Spore výtvory'

I see... so you suggest me doing utf-8 and falling back on iso-8859-1. While this would prevent crashing, I'll have to see if it lets me generate valid checksums. I'll make a quick test and come back. — tralph3, Jan 18 '21 at 01:56
@tralph3 If you're doing checksums, that is generally done on bytes, so there is no need to decode. Just do the checksum on the bytes. — Mark Tolonen, Jan 18 '21 at 06:54
Yeah, the thing is I need to parse the dictionaries, convert them to the VDF format, and then encode that into bytes to get a checksum. — tralph3, Jan 18 '21 at 10:05
Well, atlhough this didn't solve my general problem, it does seem to be the correct solution for decoding the strings, as now everything looks as it should. I still can't produce correct checksums so the problem must be somwhere else, but for the scope of this question, this solution is fine. — tralph3, Jan 18 '21 at 20:38

How can I parse a binary file that seems to use a mix of Unicode Code Points and Hex values for strings?

1 Answers1