2

I am trying to compress a data stream using pythons bz2compressor class.

The documentation says bz2compressor.compress() should return chunks of compressed data "whenever possible" but I don't get anything

I get ALL my compressed data when I flush() (I have tried with files 2GB+) still nothing.

Is there a way I can set any internal buffer limit to when it should return me data.

Thanks!

mac
  • 42,153
  • 26
  • 121
  • 131
  • "I get ALL my compressed data when I flush() (I have tried with files 2GB+) still nothing."? What does this mean? Do you get the data with flush? If so, then what's your question? Are you wondering why it doesn't seem to actually return chunks? – S.Lott Nov 23 '11 at 18:03
  • Hi, yes I want small chunks from bz2compressor.compress() (documentation says this function is supposed to return chucks) – Utkarsh Gaur Dec 07 '11 at 01:53
  • It's not *required* to return chunks. The implementation, it appears, doesn't need to. It appears you have way, way too much memory in your computer. – S.Lott Dec 07 '11 at 11:23
  • I know its not _required_ hence the question: Is there a way I can force it to return chunks - maybe flush internal buffer.. something like that – Utkarsh Gaur Dec 07 '11 at 17:48

1 Answers1

3

I get ALL my compressed data when I flush() (I have tried with files 2GB+) still nothing.

There's a trick to working with compressors.

I'll bet that your 2GB+ file was not very random. Random data doesn't compress well. Orderly data compresses to a very small size.

For example

>>> import bz2
>>> c=bz2.BZ2Compressor()
>>> import string
>>> data = string.printable*1024
>>> len(data)
102400
>>> c.compress(data)
''
>>> result= c.flush()
>>> len(result)
361

The data being supplied had a pattern, which made it compress well.

You need random data.

>>> import random
>>> c=bz2.BZ2Compressor()
>>> size= 0
>>> result= ''
>>> while result == '':
...     data = ''.join( random.choice(string.printable) for i in xrange(1024*8) )
...     size += len(data)
...     result= c.compress(data)... 
>>> len(result)
754809
>>> size
901120

I get chunks when I use really random data.

S.Lott
  • 384,516
  • 81
  • 508
  • 779