Extracting bz2 file with single file in memory

Question

I have a csv file compressed into a bz2 file that I'm trying to load from a website, decompress, and write to a local csv file by

# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())

# Extract the training data
data = bz2.decompress(archive.read())

# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)

On the decompress call, I get IOError: invalid data stream. As a note, the csv file contained in the archive has quite a few characters that could be causing some issues. Particularly, if I try putting the file contents in unicode, I get an error about not being able to decode 0xfd. I only have the single file within the archive, but I'm wondering if something could also be going on due to not extracting a specific file.

Any ideas?

init_js · Accepted Answer · 2015-11-20T00:37:43.407

I suspect you are getting this error because the stream you are feeding the decompress() function is not a valid bz2 stream.

You must also "rewind" your StringIO buffer after writing to it. See the notes below in comments. The following code (same as yours with the exception of imports, and the seek() fix) works if the URL points to a valid bz2 file.

from StringIO import StringIO
import urllib2
import bz2

# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2"  # just an example bz2 file

archive = StringIO()

# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)

archive.write(url_data.read())

# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()

# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)

# Extract the training data
data = bz2.decompress(archive.read())

# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)

re: encoding issues

Generally, character encoding errors will generate UnicodeError (or one of its cousins), but not IOError. IOError suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely.

You have omitted the imports from your question, and one of the subtle differences between the StringIO and cStringIO (according to the docs ) is that cStringIO cannot work with unicode strings that cannot be converted to ascii. That no longer seems to hold (in my tests at least), but it may be at play.

Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

You are correct about the `seek` call; I was doing that at the time of the post, but I must have copied the code to my clipboard before I made the change in the code. As a note `archive.getalue()` will work regardless of the position in the file. You were also correct about the file being invalid. I recompressed the file on the web server and the issue was fixed! — Daniel Underwood, Nov 20 '15 at 00:34

Extracting bz2 file with single file in memory

1 Answers1