How to get the internal position while reading bzip2 file

Question

I've got a script to decompress and parse data contained in a bunch of very large bzip2 compressed files. Since it can take a while I'd like to have some way to monitor the progress. I know I can get the file size with os.path.getsize(), but bz2.BZ2File.tell() returns the position within the uncompressed data. Is there any way to get the current position within the uncompressed file so I can monitor the progress?

Bonus points if there's a python equivalent to Java's ProgressMonitorInputStream.

Does this answer your question? [How to get the time needed for decompressing large bz2 files?](https://stackoverflow.com/questions/54596919/how-to-get-the-time-needed-for-decompressing-large-bz2-files) — Mitar, Mar 23 '21 at 05:03
I think this is a duplicate of another question where i posted the answer which shows how to access internal position: https://stackoverflow.com/a/66757519/252025 — Mitar, Mar 23 '21 at 05:04

score 0 · Answer 1 · edited May 23 '17 at 11:49

0

If you only need to parse the data in the bziped file, I think it should be possible to avoid to unzip the file before reading it. I have not tested it on bzip, but on gziped files. I hope this is also possible with bziped files.

See for instance : How to write csv in python efficiently?.

edited May 23 '17 at 11:49

Community

1
1

answered Mar 22 '13 at 23:03

Dvx

289
2
10

I'm only interested in a subset of the data in the files, so I don't want to unzip them completely. I'm parsing the lines as I read them and only outputting the parts I care about. – job Mar 23 '13 at 02:21
ok I thought you were unfirst unzipping your file and then parsing it. Seems you were already doing it the right way. – Dvx Mar 23 '13 at 09:58

score 0 · Accepted Answer · answered Mar 25 '13 at 14:07

This is the solution I came up with that seems to work.

import bz2

class SimpleBZ2File(object):

    def __init__(self,path,readsize=1024):
        self.decomp = bz2.BZ2Decompressor()
        self.rawinput = open(path,'rb')
        self.eof = False
        self.readsize = readsize
        self.leftover = ''

    def tell(self):
        return self.rawinput.tell()

    def __iter__(self):
        while not self.eof:
            rawdata = self.rawinput.read(self.readsize)
            if rawdata == '':
                self.eof = True
            else:
                data = self.decomp.decompress(rawdata)
                if not data:
                    continue #we need to supply more raw to decompress
                newlines = list(data.splitlines(True))
                yield self.leftover + newlines[0]
                self.leftover = ''
                for l in newlines[1:-1]:
                    yield l
                if newlines[-1].endswith('\n'):
                    yield newlines[-1]
                else:
                    self.leftover = newlines[-1]
        if self.leftover:
            yield self.leftover
        self.rawinput.close()

How to get the internal position while reading bzip2 file

2 Answers2