1

I was using bz2 earlier to try to decompress an input. The input that I wanted to decode was already in compressed format, so I decided to input the format into the interactive Python console:

>>> import bz2
>>> bz2.decompress(input)

This worked just fine without any errors. However, I got different results when I tried to extract the text from a html file and then decompress it:

file = open("example.html", "r")
contents = file.read()
# Insert code to pull out the text, which is of type 'str'
result = bz2.decompress(parsedString)

I've checked the string I parsed with the original one, and it looks identical. Furthermore, when I copy and paste the string I wish to decompress into my .py file (basically enclosing it with double parentheses ""), it works fine. I have also tried to open with "rb" in hopes that it'll look at the .html file as a binary, though that failed to work as well.

My questions are: what is the difference between these two strings? They are both of type 'str', so I'm assuming there is an underlying difference I am missing. Furthermore, how would I go about retrieving the bz2 content from the .html in such a way that it will not be treated as an incorrect datastream? Any help is appreciated. Thanks!

kirelagin
  • 13,248
  • 2
  • 42
  • 57
Zhouster
  • 746
  • 3
  • 13
  • 23
  • I dont' think we can help you withouht seing the actual code. If your resulting strings are equal, results should be the same. Start by simply extracting both strings and comparing them. – kirelagin Jun 09 '13 at 07:12
  • I think korylprince answered it below. :X Thanks for the advice though. I just tried comparing them and it returned false, but when I did "type(parsedString)", both returned 'str'. – Zhouster Jun 09 '13 at 23:04

1 Answers1

2

My guess is that the html file has the text representation of the data instead of the actual binary data in the file itself.

For instance take a look at the following code:

>>> t = '\x80'
>>> print t
>>> '\x80'

But say I create a text file with the contents \x80 and do:

with open('file') as f:
    t = f.read()
print t

I would get back:

'\\x80'

If this is the case, you could use eval to get the desired result:

result = bz2.decompress(eval('"'+parsedString'"'))

Just make sure that you only do this for trusted data.

korylprince
  • 2,969
  • 1
  • 18
  • 27
  • Thank you korylprince for this response! The eval() call worked perfectly. As a follow-up question, could I ask how you can tell when a file is using the text representation versus the actual binary data? – Zhouster Jun 09 '13 at 18:54
  • Open the file in a text editor or firefox. If you see things like ```\x80``` then it's probably in plain text. If you see lots of weird symbols then it's probably in binary. The most exact way would to be use a hex editor that shows ascii, and see if it's using the symbols or the text. – korylprince Jun 09 '13 at 21:02