1

I have a bunch of strings which are sentences that look something like this:

Having two illnesses at the same time is known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and it can make treating each disorder more difficult.

I encoded the original string with .encode() then compressed with python's bz2 library.

I then decompressed with bz2.decompress() and used .decode() to get it back.

Any ideas how I can conveniently remove these bytestrings from the text or avoid characters like quotes not getting decoded properly?

Thanks!

Peter Charland
  • 409
  • 6
  • 18
  • What makes you think these characters are not getting encoded properly? – Konrad Rudolph Jan 28 '20 at 13:34
  • Sorry, I guess that wasn't what I meant; I'll edit. They're not getting decoded. – Peter Charland Jan 28 '20 at 13:36
  • 1
    Can you be more specific? I just tried encoding a string using `.encode()`, then compressing it using `bz2.compress()`, then proceeded to `bz2.decompress()` which already gave a good output. Even after `.decode()` the output was correct. Using Python 3.8.1 – small_cat_destroyer Jan 28 '20 at 13:39
  • Hey guys, I got an answer. I did it correctly in most of my code by apparently in a couple of instances I accidentally just used str() on a bytestring. Thanks for the help! – Peter Charland Jan 29 '20 at 04:06

2 Answers2

1

Looks to me like you didn't actualy decode the data properly as interpreting \xe2\x80\x9ccomorbidity\xe2\x80\x9d as bytes and decoding yields a very sensible string:

>>> b"\xe2\x80\x9ccomorbidity\xe2\x80\x9d"
b'\xe2\x80\x9ccomorbidity\xe2\x80\x9d'
>>> _.decode()
'“comorbidity”'

Either that or the original data was improperly generated / decoded in the first place (before it was encoded to UTF-8 and compressed) e.g. a UTF8 data source was read as ISO-8859-1 (which is essentially a passthrough).

So these are the bits I'd look at:

  • do you actually properly decode after decompressing
  • is the original data correct
Masklinn
  • 34,759
  • 3
  • 38
  • 57
1

I am guessing that you mistakenly assigned the above byte string “sentence” to an object of type str. Instead, it needs to be assigned to a byte string object and interpret it as a sequence of UTF-8 bytes. Compare:

b = b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
s = b.decode('utf-8')
print(b)
# b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
print(s)
# ... known as “comorbidity” and ...

Either way, the issue is unrelated to compression: a lossless compression (such as bzip2) roundtrip never changes the data:

print(bz2.decompress(bz2.compress(b)).decode('utf-8'))
# ... known as “comorbidity” and ...
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214