I suspect you are getting this error because the stream you are feeding the decompress()
function is not a valid bz2 stream.
You must also "rewind" your StringIO
buffer after writing to it. See the notes below in comments. The following code (same as yours with the exception of imports, and the seek()
fix) works if the URL points to a valid bz2 file.
from StringIO import StringIO
import urllib2
import bz2
# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2" # just an example bz2 file
archive = StringIO()
# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()
# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)
re: encoding issues
Generally, character encoding errors will generate UnicodeError
(or one of its cousins), but not IOError
. IOError
suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely.
You have omitted the imports from your question, and one of the subtle differences between the StringIO
and cStringIO
(according to the docs ) is that cStringIO
cannot work with unicode strings that cannot be converted to ascii. That no longer seems to hold (in my tests at least), but it may be at play.
Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.