3

I have a bz2 compressed binary (big endian) file containing an array of data. Uncompressing it with external tools and then reading the file in to Numpy works:

import numpy as np
dim = 3
rows = 1000
cols = 2000
mydata = np.fromfile('myfile.bin').reshape(dim,rows,cols)

However, since there are plenty of other files like this I cannot extract each one individually beforehand. Thus, I found the bz2 module in Python which might be able to directly decompress it in Python. However I get an error message:

dfile = bz2.BZ2File('myfile.bz2').read()
mydata = np.fromfile(dfile).reshape(dim,rows,cols)

>>IOError: first argument must be an open file

Obviously, the BZ2File function does not return a file object. Do you know what is the correct way read the compressed file?

HyperCube
  • 3,870
  • 9
  • 41
  • 53

1 Answers1

5

BZ2File does return a file-like object (although not an actual file). The problem is that you're calling read() on it:

dfile = bz2.BZ2File('myfile.bz2').read()

This reads the entire file into memory as one big string, which you then pass to fromfile.

Depending on your versions of numpy and python and your platform, reading from a file-like object that isn't an actual file may not work. In that case, you can use the buffer you read in with frombuffer.

So, either this:

dfile = bz2.BZ2File('myfile.bz2')
mydata = np.fromfile(dfile).reshape(dim,rows,cols)

… or this:

dbuf = bz2.BZ2File('myfile.bz2').read()
mydata = np.frombuffer(dbuf).reshape(dim,rows,cols)

(Needless to say, there are a slew of other alternatives that might be better than reading the whole buffer into memory. But if your file isn't too huge, this will work.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • `frombuffer()` doesn't seem to work in python2.7. It fails with `AttributeError: 'ExFileObject' object has no attribute '__buffer__'`. Any idea why? – con-f-use Jun 12 '16 at 20:58
  • Never mind, I was using `zipfl = bz2.BZ2File('myfile.bz2').open('file_memver_in_archive')` because I wanted to get `dbuf.name` and other attributes of the member. However, if one aclutally needs `zipfl` to be an `ExFileObjcet`, like I do, one can simply do `mydata = np.frombuffer(zipfl.read())` and have the best of both worlds. – con-f-use Jun 12 '16 at 21:42
  • BTW, using np.fromfile directly to the bz2 file, does not work for me, but np.frombuffer works fine. – Pablo Reyes Feb 27 '19 at 06:19