Reverse Engineering 'UTF-8 Like' Encoding Algorithm

Question

I'm attempting to reverse engineer an encoding algorithm to ensure backwards compatibility with other software packages. For each type of quantity to be encoded in the output file, there is a separate encoding procedure.

The given documentation only shows the end-user how to parse values from the encoded file, not write anything back to it. However, I have been able to successfully create a corresponding write_int() for every documented read_int() for every file type except the read_string() below.

I am currently (and have been for a while) struggling to wrap my head around exactly what is going on in the read_string() function listed below.

I understand fully that this is a masking problem, and that the first operation while partial_length & 0x80 > 0: is a simple bitwise mask that mandates we only enter the loop when we examine values larger than 128, I begin to lose my head when trying to assign or extract meaning from the loop that is within that while statement. I get the mathematical machinery behind the operations, but I can't see why they would be doing things in this way.

I have included the read_byte() function for context, as it is called in the read_string() function.

def read_byte(handle):
    return struct.unpack("<B", handle.read(1))[0]

def read_string(handle):
    total_length = 0
    partial_length = read_byte(handle)
    num_bytes = 0
    while partial_length & 0x80 > 0:
        total_length += (partial_length & 0x7F) << (7 * num_bytes)
        partial_length = ord(struct.unpack("c", handle.read(1))[0])
        num_bytes += 1
    total_length += partial_length << (7 * num_bytes)
    result = handle.read(total_length)
    result = result.decode("utf-8")
    if len(result) < total_length:
        raise Exception("Failed to read complete string")
    else:
        return result

Is this indicative of an impossible task due to information loss, or am I missing an obvious way to perform the opposite of this read_string function?

I would greatly appreciate any information, insights (however obvious you may think they may be), help, or pointers possible, even if it means just a link to a page that you think might prove useful.

Cheers!

It reminds me of ASN.1 DER encoding for an ASN.1 OCTET_STRING. — President James K. Polk, Feb 23 '22 at 01:12

score 0 · Accepted Answer · answered Feb 23 '22 at 01:23

0

It's just reading a length, which then tells it how many characters to read. (I don't get the check at the end but that's a different issue.)

In order to avoid a fixed length for the length, the length is divided into seven-bit units, which are sent low-order chunk first. Each seven-bit unit is sent in a single 8-bit byte with the high-order bit set, except the last unit which is sent as is. Thus, the reader knows when it gets to the end of the length, because it reads a byte whose high-order bit is 0 (in other words, a byte less than 0x80).

answered Feb 23 '22 at 01:23

rici

234,347
28
237
341

I really don't get the check at the end either, it seems obvious to me that the length of the string being read would match the length of the encoded representation – bbpsword Feb 23 '22 at 05:15
@bbpsword: UTF-8 is a multibyte encoding so I eould expect the result of decoding `k` bytes of UTF-8 to be fewer than `k` characters, unless the entire sequence is Ascii. But maybe there's something I don't see. – rici Feb 23 '22 at 05:49
maybe that's the point, is that it inherently limits all characters used to ascii, otherwise the length of the total_length segment won't match the length of the decoded string. – bbpsword Feb 23 '22 at 17:16

Reverse Engineering 'UTF-8 Like' Encoding Algorithm

1 Answers1