Getting right encoding for e-mail from gmail API

Question

I am struggling to get special characters from an email to display correctly.

I get the message using the Gmail API like this:

msg_id = '169a8fac44fd8115'
service = build('gmail', 'v1', credentials=creds)
message = service.users().messages().get(userId='me', id=msg_id).execute()
htmlpart = message['payload']['parts'][0]['parts'][1]['body']['data']

I've then tried the following:

file_data = quopri.decodestring(base64.urlsafe_b64decode(htmlpart)).decode('iso-8859-1')
file_data = base64.urlsafe_b64decode(htmlpart.encode('UTF-8')).decode('iso-8859-1')
file_data = base64.urlsafe_b64decode(htmlpart.encode('iso-8859-1')).decode('utf-8')
file_data = base64.urlsafe_b64decode(htmlpart.encode('UTF-8')).decode('utf-8')

None of them get me the right output. Instead I get things like â‚¬2 instead of €.

For reference, the headers of this message are as follows:

'headers': [{'name': 'Content-Type', 'value': 'text/html; charset="UTF-8"'},
  {'name': 'Content-Transfer-Encoding', 'value': 'quoted-printable'}]

Edit: added sample data below. I am trying to get the html of an e-mail, I am copying below just a part of it which highlights the encoding problem (You'll get).

</tr><tr><td class="m_4364729876101169671Uber18_text_p1" align="left" style="color:rgb(0,0,0);font-family:&#39;Uber18-text-Regular&#39;,&#39;HelveticaNeue-Light&#39;,&#39;Helvetica Neue Light&#39;,Helvetica,Arial,sans-serif;font-size:16px;line-height:28px;direction:ltr;text-align:left"> Give friends free ride credit to try Uber. You&#39;ll get CN¥10 off each of your next 3 rides when they start riding. <span class="m_4364729876101169671Uber18_text_p1" style="color:#000000;font-family:&#39;Uber18-text-Regular&#39;,&#39;HelveticaNeue-Light&#39;,&#39;Helvetica Neue Light&#39;,Helvetica,Arial,sans-serif;font-size:16px;line-height:28px">Share code: 20ccv</span></td>

snakecharmerb · Accepted Answer · 2019-03-24T07:54:24.333

The headers

'headers': [{'name': 'Content-Type', 'value': 'text/html; charset="UTF-8"'},
  {'name': 'Content-Transfer-Encoding', 'value': 'quoted-printable'}]

are telling you that the message consists of text encoded as UTF-8, then quoted-printable encoded so that it can be processed by systems that only support 7-bit characters.

To decode, you need to decode from quoted-printable first, and then decode the resulting bytes from UTF-8.

Something like this ought to work:

utf8 = quopri.decodestring(htmlpart)
text = ut8.decode('utf-8')

HTML email bodies may contain character entities. These can be converted to individual characters using html.unescape (available in Python 3.4+).

>>> import html 
>>> h = """</tr><tr><td class="m_4364729876101169671Uber18_text_p1" align="left" style="color:rgb(0,0,0);font-family:&#39;Uber18-text-Regular&#39;,&#39;HelveticaNeue-Light&#39;,&#39;Helvetica Neue Light&#39;,Helvetica,Arial,sans-serif;font-size:16px;line-height:28px;direction:ltr;text-align:left"> Give friends free ride credit to try Uber. You&#39;ll get CN¥10 off each of your next 3 rides when they start riding. <span class="m_4364729876101169671Uber18_text_p1" style="color:#000000;font-family:&#39;Uber18-text-Regular&#39;,&#39;HelveticaNeue-Light&#39;,&#39;Helvetica Neue Light&#39;,Helvetica,Arial,sans-serif;font-size:16px;line-height:28px">Share code: 20ccv</span></td>"""


>>> print(html.unescape(h))
</tr><tr><td class="m_4364729876101169671Uber18_text_p1" align="left" style="color:rgb(0,0,0);font-family:'Uber18-text-Regular','HelveticaNeue-Light','Helvetica Neue Light',Helvetica,Arial,sans-serif;font-size:16px;line-height:28px;direction:ltr;text-align:left"> Give friends free ride credit to try Uber. You'll get CN¥10 off each of your next 3 rides when they start riding. <span class="m_4364729876101169671Uber18_text_p1" style="color:#000000;font-family:'Uber18-text-Regular','HelveticaNeue-Light','Helvetica Neue Light',Helvetica,Arial,sans-serif;font-size:16px;line-height:28px">Share code: 20ccv</span></td>

Thanks. If I use this code it outputs a huge list of seemingly random characters (I guess it's base64). If I add base64 decoding like this: utf8 = quopri.decodestring(base64.urlsafe_b64decode(htmlpart)) file_data = utf8.decode('utf-8') then it tells me UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 30992: invalid continuation byte — Alexis Eggermont, Mar 23 '19 at 08:11
Can you show a sample of the data that you're trying to decode? — snakecharmerb, Mar 23 '19 at 09:37

Getting right encoding for e-mail from gmail API

1 Answers1