1

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.

In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.

In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.

I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:

var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233 
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2

I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.

Some more specific questions might be:

  • Why would buffer.readInt32LE work for positive ints but not negative?
  • Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
  • Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?
k8si
  • 13
  • 2
  • A really reduced version of the BrowserFS code works on negative values. Can you post the output of calling `readUInt8` four times to verify you're reading the expected sequence `FE FF FF FF` ? – Ryan Cavanaugh Oct 26 '14 at 18:50
  • That's a big part of the problem -- I don't get the expected sequence. Instead, for -2 I get: ef bf bd ef – k8si Oct 26 '14 at 18:55
  • Seems like you're just reading the wrong part of the stream then, or Python is not using the number format you think it is. I don't think there's any rational sequence of bitwise operations that would turn `EF BF BD EF` into `-2`. – Ryan Cavanaugh Oct 26 '14 at 18:59
  • The thing is, I'm testing with the compiled bytecode for `a = 2, b = -2` and so on for several numbers. The positive numbers are correct, as well as the whole "code object" structure in general. The only things that are not correct are the negative number values. So I don't know how I would change where I'm reading in the stream without messing everything else up. – k8si Oct 26 '14 at 19:04

1 Answers1

0

The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.

It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.

This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Community
  • 1
  • 1
Ryan Cavanaugh
  • 209,514
  • 56
  • 272
  • 235
  • Wow, I definitely should have checked this first, before the above rigmarole. Thank you. – k8si Oct 26 '14 at 19:36
  • If you happen to post your code on GitHub or elsewhere, can you drop a comment here? I'm on the TypeScript team and we're always on the lookout for larger/interesting codebases to use for analysis and regression testing. – Ryan Cavanaugh Oct 27 '14 at 03:40
  • I'll ask my professor if I can make the repository public and let you know – k8si Oct 28 '14 at 04:05