How do I remove bytestrings left over from decompression from a string?

Question

I have a bunch of strings which are sentences that look something like this:

Having two illnesses at the same time is known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and it can make treating each disorder more difficult.

I encoded the original string with .encode() then compressed with python's bz2 library.

I then decompressed with bz2.decompress() and used .decode() to get it back.

Any ideas how I can conveniently remove these bytestrings from the text or avoid characters like quotes not getting decoded properly?

Thanks!

What makes you think these characters are not getting encoded properly? — Konrad Rudolph, Jan 28 '20 at 13:34
Sorry, I guess that wasn't what I meant; I'll edit. They're not getting decoded. — Peter Charland, Jan 28 '20 at 13:36
Can you be more specific? I just tried encoding a string using `.encode()`, then compressing it using `bz2.compress()`, then proceeded to `bz2.decompress()` which already gave a good output. Even after `.decode()` the output was correct. Using Python 3.8.1 — small_cat_destroyer, Jan 28 '20 at 13:39
Hey guys, I got an answer. I did it correctly in most of my code by apparently in a couple of instances I accidentally just used str() on a bytestring. Thanks for the help! — Peter Charland, Jan 29 '20 at 04:06

score 1 · Answer 1 · answered Jan 28 '20 at 13:40

Looks to me like you didn't actualy decode the data properly as interpreting \xe2\x80\x9ccomorbidity\xe2\x80\x9d as bytes and decoding yields a very sensible string:

>>> b"\xe2\x80\x9ccomorbidity\xe2\x80\x9d"
b'\xe2\x80\x9ccomorbidity\xe2\x80\x9d'
>>> _.decode()
'“comorbidity”'

Either that or the original data was improperly generated / decoded in the first place (before it was encoded to UTF-8 and compressed) e.g. a UTF8 data source was read as ISO-8859-1 (which is essentially a passthrough).

So these are the bits I'd look at:

do you actually properly decode after decompressing
is the original data correct

Thanks! The other answer was first but this would have done it too. — Peter Charland, Jan 29 '20 at 04:09

score 1 · Accepted Answer · answered Jan 28 '20 at 13:42

I am guessing that you mistakenly assigned the above byte string “sentence” to an object of type str. Instead, it needs to be assigned to a byte string object and interpret it as a sequence of UTF-8 bytes. Compare:

b = b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
s = b.decode('utf-8')
print(b)
# b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
print(s)
# ... known as “comorbidity” and ...

Either way, the issue is unrelated to compression: a lossless compression (such as bzip2) roundtrip never changes the data:

print(bz2.decompress(bz2.compress(b)).decode('utf-8'))
# ... known as “comorbidity” and ...

Thanks! You were exactly right. I missed one line where I used str() on it without decoding. >.< Much appreciated! — Peter Charland, Jan 29 '20 at 04:08

How do I remove bytestrings left over from decompression from a string?

2 Answers2