0

I'm attempting to reverse engineer an encoding algorithm to ensure backwards compatibility with other software packages. For each type of quantity to be encoded in the output file, there is a separate encoding procedure.

The given documentation only shows the end-user how to parse values from the encoded file, not write anything back to it. However, I have been able to successfully create a corresponding write_int() for every documented read_int() for every file type except the read_string() below.

I am currently (and have been for a while) struggling to wrap my head around exactly what is going on in the read_string() function listed below.

I understand fully that this is a masking problem, and that the first operation while partial_length & 0x80 > 0: is a simple bitwise mask that mandates we only enter the loop when we examine values larger than 128, I begin to lose my head when trying to assign or extract meaning from the loop that is within that while statement. I get the mathematical machinery behind the operations, but I can't see why they would be doing things in this way.

I have included the read_byte() function for context, as it is called in the read_string() function.

def read_byte(handle):
    return struct.unpack("<B", handle.read(1))[0]

def read_string(handle):
    total_length = 0
    partial_length = read_byte(handle)
    num_bytes = 0
    while partial_length & 0x80 > 0:
        total_length += (partial_length & 0x7F) << (7 * num_bytes)
        partial_length = ord(struct.unpack("c", handle.read(1))[0])
        num_bytes += 1
    total_length += partial_length << (7 * num_bytes)
    result = handle.read(total_length)
    result = result.decode("utf-8")
    if len(result) < total_length:
        raise Exception("Failed to read complete string")
    else:
        return result

Is this indicative of an impossible task due to information loss, or am I missing an obvious way to perform the opposite of this read_string function?

I would greatly appreciate any information, insights (however obvious you may think they may be), help, or pointers possible, even if it means just a link to a page that you think might prove useful.

Cheers!

bbpsword
  • 3
  • 3

1 Answers1

0

It's just reading a length, which then tells it how many characters to read. (I don't get the check at the end but that's a different issue.)

In order to avoid a fixed length for the length, the length is divided into seven-bit units, which are sent low-order chunk first. Each seven-bit unit is sent in a single 8-bit byte with the high-order bit set, except the last unit which is sent as is. Thus, the reader knows when it gets to the end of the length, because it reads a byte whose high-order bit is 0 (in other words, a byte less than 0x80).

rici
  • 234,347
  • 28
  • 237
  • 341
  • I really don't get the check at the end either, it seems obvious to me that the length of the string being read would match the length of the encoded representation – bbpsword Feb 23 '22 at 05:15
  • @bbpsword: UTF-8 is a multibyte encoding so I eould expect the result of decoding `k` bytes of UTF-8 to be fewer than `k` characters, unless the entire sequence is Ascii. But maybe there's something I don't see. – rici Feb 23 '22 at 05:49
  • maybe that's the point, is that it inherently limits all characters used to ascii, otherwise the length of the total_length segment won't match the length of the decoded string. – bbpsword Feb 23 '22 at 17:16