Python bz2 - text vs. interactive console (data stream)

Question

I was using bz2 earlier to try to decompress an input. The input that I wanted to decode was already in compressed format, so I decided to input the format into the interactive Python console:

>>> import bz2
>>> bz2.decompress(input)

This worked just fine without any errors. However, I got different results when I tried to extract the text from a html file and then decompress it:

file = open("example.html", "r")
contents = file.read()
# Insert code to pull out the text, which is of type 'str'
result = bz2.decompress(parsedString)

I've checked the string I parsed with the original one, and it looks identical. Furthermore, when I copy and paste the string I wish to decompress into my .py file (basically enclosing it with double parentheses ""), it works fine. I have also tried to open with "rb" in hopes that it'll look at the .html file as a binary, though that failed to work as well.

My questions are: what is the difference between these two strings? They are both of type 'str', so I'm assuming there is an underlying difference I am missing. Furthermore, how would I go about retrieving the bz2 content from the .html in such a way that it will not be treated as an incorrect datastream? Any help is appreciated. Thanks!

I dont' think we can help you withouht seing the actual code. If your resulting strings are equal, results should be the same. Start by simply extracting both strings and comparing them. — kirelagin, Jun 09 '13 at 07:12
I think korylprince answered it below. :X Thanks for the advice though. I just tried comparing them and it returned false, but when I did "type(parsedString)", both returned 'str'. — Zhouster, Jun 09 '13 at 23:04

score 2 · Accepted Answer · answered Jun 09 '13 at 07:50

2

My guess is that the html file has the text representation of the data instead of the actual binary data in the file itself.

For instance take a look at the following code:

>>> t = '\x80'
>>> print t
>>> '\x80'

But say I create a text file with the contents \x80 and do:

with open('file') as f:
    t = f.read()
print t

I would get back:

'\\x80'

If this is the case, you could use eval to get the desired result:

result = bz2.decompress(eval('"'+parsedString'"'))

Just make sure that you only do this for trusted data.

answered Jun 09 '13 at 07:50

korylprince

2,969
1
18
27

Thank you korylprince for this response! The eval() call worked perfectly. As a follow-up question, could I ask how you can tell when a file is using the text representation versus the actual binary data? – Zhouster Jun 09 '13 at 18:54
Open the file in a text editor or firefox. If you see things like ```\x80``` then it's probably in plain text. If you see lots of weird symbols then it's probably in binary. The most exact way would to be use a hex editor that shows ascii, and see if it's using the symbols or the text. – korylprince Jun 09 '13 at 21:02

Python bz2 - text vs. interactive console (data stream)

1 Answers1