Use windows compression on file from Python

Question

I've done a kind of serialisation thing (mainly just to see if I could get a basic one working). Long story short, I had to remove what little compression I had, to make it so that I could read a huge file without having a problem with memory (I'm using seek and read together to read the file with only what I need at the time). As you can see below, what was originally a much shorter base64 encoded string is now a very long string of 1's and 0's.

input = (None, False, 'one', {2: [3, 4, 5])

#encoded = 'WCVBFA29uZUggwEBDAQGMBAgwECg'
#not encoded = '01011000001001010100000100010100000011011011110110111001100101001110000000100011000000010000000100100100000011000110000000100000001100011000000010000001000001100000001000000101'

I had a look into it but there doesn't seem to be a way to read the file as bits, so I'm kinda stuck with the current way. I'm guessing there wouldn't be an easy way to compress and read the file since I'm already reading small segments, but would I be able to define on file creation that I want windows itself to compress it?

As it's literally only storing the characters 0 and 1, even the default compression should do wonders to it, I just don't want to have to load it all into memory initially to decompress the entire thing.

I'm not quite sure what you're trying to do, but it's certainly possible to write and read binary files in Python. You can't do it bit by bit, but you can do it byte by byte if you want, though it's generally better to work with larger chunks. — PM 2Ring, Apr 16 '16 at 17:48
I did try it byte by byte, but with the way I did the code, it's not necessarily on multiples of 8, so I'd have to read a full byte to get the first 2 bits, then the same byte again with a few more to get the next 20 bits, etc, which ended up very slow even for `range(1000)`. When you right click on a file in windows and choose you want it to be compressed, that's what I'm hoping to do automatically :) — Peter, Apr 16 '16 at 20:35
FWIW, [here's](http://stackoverflow.com/a/31700898/4014959) some Python 2 code I posted last year that you may find helpful. It does bitwise manipulation of binary data. — PM 2Ring, Apr 17 '16 at 06:46
Thanks, not sure if you noticed but that's also my question haha. I've gotta say I'm not sure why I didn't reply (unless it got deleted for not being useful), but I think I kinda did your idea using basic Python instead of the bit operations, so I'll check out the complicated bit now :) — Peter, Apr 17 '16 at 12:24
Just as a general idea before I look into it, could I efficiently say I want bits 1 and 2 from byte 1 for the first variable, then like bits 3-8 from byte 1 and 1-4 from byte 2 for the second (basically first 2 bits then the next 9 bits for example)? — Peter, Apr 17 '16 at 12:27
No, I didn't notice that that was your question too. :) That code reads the entire file into memory, since it's usual practice to read and write image file bytes in one hit, and also because it's using PIL to read & write the image bytes, and PIL does that stuff in one hit. But if you really want to it's possible to create a class that "wraps" a binary file that lets you read bits from it in the form of strings (or write such strings to it). The class handles the messy details internally so the rest of your code can just work with strings like `'0110110'`, etc. — PM 2Ring, Apr 17 '16 at 12:51
I clicked the link and was thinking it seemed awfully similar to something I'd worked on before haha. Anyway, I'll try figure out how to read bits from a file as strings then I guess (would it use similar stuff to your PIL code?), that'd hopefully be useful since my function only needs the resulting 1's and 0's anyway :) — Peter, Apr 17 '16 at 17:54
Damn, I came across [this answer](http://stackoverflow.com/questions/2576712/using-python-how-can-i-read-the-bits-in-a-byte/2576841#2576841) which seemed really nice, it runs over 5x slower with it though, I'm not so sure it's a feasible thing to write as bytes and read the bits :P — Peter, Apr 18 '16 at 11:09
Yes, getting 1 bit per function call is _really_ slow, since there's a fair bit of overhead in Python function calls. And unless your file is so large you can't load the whole thing into RAM I advise you to do what my PIL_Stegano program does, and read all the bytes at once and do the conversion to bits in one go. FWIW, here's a slightly cleaner way to build the conversion dictionary: `bit_dict = dict((tuple(int(c) for c in format(i, '08b')), i) for i in xrange(256))` or in Python 3: `bit_dict = {tuple(int(c) for c in format(i, '08b')) : i for i in range(256)}` — PM 2Ring, Apr 18 '16 at 14:17
Thanks, though the original code could handle it all in RAM (I used `''.join('{0:08b}'.format(i) for i in bytearray(x))`), it's just that I did a script to search all files on the computer a while back, and the resulting file was over 3gb. Searching was very fast, but you wouldn't want nearly 4gb of RAM reserved just to search the PC. That gave me the idea for only reading the required parts to link my serialization code with it, but now there's a choice between 5x slower or an 8x larger file haha. — Peter, Apr 18 '16 at 18:41

Use windows compression on file from Python

0 Answers0