20

Is it possible to append to a gzipped text file on the fly using Python ?

Basically I am doing this:-

import gzip
content = "Lots of content here"
f = gzip.open('file.txt.gz', 'a', 9)
f.write(content)
f.close()

A line is appended (note "appended") to the file every 6 seconds or so, but the resulting file is just as big as a standard uncompressed file (roughly 1MB when done).

Explicitly specifying the compression level does not seem to make a difference either.

If I gzip an existing uncompressed file afterwards, it's size comes down to roughly 80kb.

Im guessing its not possible to "append" to a gzip file on the fly and have it compress ?

Is this a case of writing to a String.IO buffer and then flushing to a gzip file when done ?

general exception
  • 4,202
  • 9
  • 54
  • 82
  • 5
    For the gzip algorithm to work efficiently, it has to get its hands on the entire content to be compressed. Otherwise, you're just appending chunks of gzipped content that have nothing to do with each other. – Nadh Aug 07 '13 at 07:34
  • @Nadh so I guess my last line is correct ? Write to a String.IO and flush to gzip ? – general exception Aug 07 '13 at 07:38
  • 1
    Yes, that should work. You just have to make sure that all content is gzipped together at any instant. – Nadh Aug 07 '13 at 07:40
  • I vaguely remember that zlib can be used to perform streaming compression, i.e. without seeing all the data in advance. – Hans Then Aug 07 '13 at 08:09
  • 3
    The problem is appending only one line of data at once. For gzip to work efficiently, it needs at least *some* amount of data at once --- not necessarily the whole file, but certainly more than one line. If sending the whole file at once is too much, you can also send it pieces of 16KB or something. – Armin Rigo Aug 07 '13 at 08:26
  • Assuming this is a pre-processing of data, can you append that line right before processing the data. That is, instead of open gzip -> write -> close -> open gzip -> process, do open gzip -> read -> add one line -> process – Patrick the Cat Aug 07 '13 at 18:35
  • note that your snippet doesn't work in python 3 unless you add the `t` attribute (text mode). – Jean-François Fabre Oct 12 '17 at 12:03

1 Answers1

19

That works in the sense of creating and maintaining a valid gzip file, since the gzip format permits concatenated gzip streams.

However it doesn't work in the sense that you get lousy compression, since you are giving each instance of gzip compression so little data to work with. Compression depends on taking advantage the history of previous data, but here gzip has been given essentially none.

You could either a) accumulate at least a few K of data, many of your lines, before invoking gzip to add another gzip stream to the file, or b) do something much more sophisticated that appends to a single gzip stream, leaving a valid gzip stream each time and permitting efficient compression of the data.

You find an example of b) in C, in gzlog.h and gzlog.c. I do not believe that Python has all of the interfaces to zlib needed to implement gzlog directly in Python, but you could interface to the C code from Python.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158