I'm using beautiful soup to parse email invoices and I'm running into consistent problem involving special characters.
The text I am trying to parse is shown in the image.
But what I get from beautiful soup after finding the element and calling elem.text is this:
'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'
As you can see the apostrophe is now represented by "=E2=80=99", double quotes are "=E2=80=9C" and "=E2=80=9D" and there are seemingly random newlines in the text, for example "product=\r\ns". The newlines don't seem to appear in the image.
Apparently "E2 80 99" is the unicode hex representation of ' , but I don't understand why I can still see it in this form after having done email.decode('utf-8') before sending it to beautiful soup.
This is the element
<td border:="" class='3D"td"' left="" middle="" padding:="" solid="" style='3D"color:' text-align:="" v="ertical-align:">Hi Mike, It=E2=80=
=99s probably not a big drama if you are having problems separating product=
s from classes. It is not uncommon to receive an order for pole classes and=
a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your =
system will not be able to place into a class list, hence having the extra =
sheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.</td>
I can post my code if required but I figure I must be making a simple mistake.
I checked out the answer to this question Decode Hex String in Python 3 but i think that expects the entire string to be hex rather than just having random hex parts. but I'm honestly not even sure how to search for "decode partial hex strings"
My final questions are
Q1 How do I convert
'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'
into
'Hi Mike, It's probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any "erroneous" orders will be handy.'
using python 3, without manually fixing each string and writing a replace method for each possible character.
Q2 Why does this "=\r\n" appear everywhere in my string but not in the rendered html?