2

I've got a script to decompress and parse data contained in a bunch of very large bzip2 compressed files. Since it can take a while I'd like to have some way to monitor the progress. I know I can get the file size with os.path.getsize(), but bz2.BZ2File.tell() returns the position within the uncompressed data. Is there any way to get the current position within the uncompressed file so I can monitor the progress?

Bonus points if there's a python equivalent to Java's ProgressMonitorInputStream.

job
  • 9,003
  • 7
  • 41
  • 50
  • Does this answer your question? [How to get the time needed for decompressing large bz2 files?](https://stackoverflow.com/questions/54596919/how-to-get-the-time-needed-for-decompressing-large-bz2-files) – Mitar Mar 23 '21 at 05:03
  • I think this is a duplicate of another question where i posted the answer which shows how to access internal position: https://stackoverflow.com/a/66757519/252025 – Mitar Mar 23 '21 at 05:04

2 Answers2

0

If you only need to parse the data in the bziped file, I think it should be possible to avoid to unzip the file before reading it. I have not tested it on bzip, but on gziped files. I hope this is also possible with bziped files.

See for instance : How to write csv in python efficiently?.

Community
  • 1
  • 1
Dvx
  • 289
  • 2
  • 10
  • I'm only interested in a subset of the data in the files, so I don't want to unzip them completely. I'm parsing the lines as I read them and only outputting the parts I care about. – job Mar 23 '13 at 02:21
  • ok I thought you were unfirst unzipping your file and then parsing it. Seems you were already doing it the right way. – Dvx Mar 23 '13 at 09:58
0

This is the solution I came up with that seems to work.

import bz2

class SimpleBZ2File(object):

    def __init__(self,path,readsize=1024):
        self.decomp = bz2.BZ2Decompressor()
        self.rawinput = open(path,'rb')
        self.eof = False
        self.readsize = readsize
        self.leftover = ''

    def tell(self):
        return self.rawinput.tell()

    def __iter__(self):
        while not self.eof:
            rawdata = self.rawinput.read(self.readsize)
            if rawdata == '':
                self.eof = True
            else:
                data = self.decomp.decompress(rawdata)
                if not data:
                    continue #we need to supply more raw to decompress
                newlines = list(data.splitlines(True))
                yield self.leftover + newlines[0]
                self.leftover = ''
                for l in newlines[1:-1]:
                    yield l
                if newlines[-1].endswith('\n'):
                    yield newlines[-1]
                else:
                    self.leftover = newlines[-1]
        if self.leftover:
            yield self.leftover
        self.rawinput.close()
job
  • 9,003
  • 7
  • 41
  • 50