Im not sure if this is google or email in general, but I'm seeing a some encoding that im not sure how to handle. Here is a snip that is the Washington post mailer form my google acct.
the subject
b'=?UTF-8?Q?The_Morning:_Peru=E2=80=99s_deadly_protests?='
actually reads
The Morning: Peru’s deadly protests
part of the body.
<!DOCTYPE html><html xmlns=3D"http://www.w3.org/1999/xhtml" xmlns:v=3D"urn:=
schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-microsoft-com:office:offi=
ce"><head> <title>The Morning: Peru=E2=80=99s deadly protests</title> <!--[=
if !mso]><!-- --> <meta http-equiv=3D"X-UA-Compatible" content=3D"IE=3Dedge=
"> <!--<![endif]--> <meta http-equiv=3D"Content-Type" content=3D"text/html;=
charset=3DUTF-8"> <meta name=3D"viewport" content=3D"width=3Ddevice-width,=
initial-scale=3D1"> <style type=3D"text/css">#outlook a{padding:0}body{marg=
in:0;padding:0;-webkit-text-size-adjust:100%;-ms-text-size-adjust:100%}tabl=
e,td{border-collapse:collapse;mso-table-lspace:0;mso-table-rspace:0}img{bor=
der:0;height:auto;line-height:100%;outline:0;text-decoration:none;-ms-inter=
polation-mode:bicubic}p{display:block;margin:13px 0}p,ul{margin-top:0}@medi=
a (max-width:600px){body{padding:0 15px!important}}</style> <!--[if mso]>=
=0A <xml>=0A <o:OfficeDocumentSettings>=0A
Everything is line wrapped with a '=\f\n'
which is no problem to remove with
bodytext = ''.join( bodytext.split('=\r\n') )
But there is other stuff in there like =3D
=0D=0A
and yeh they are ascii encodings but what decoder library do we use to decode this?
For reference or to try yourself: here is the python code.
import email, datetime
from imapclient import IMAPClient
with IMAPClient('imap.gmail.com', port=None, use_uid=True, ssl=True, stream=False, ssl_context=None, timeout=None) as client:
client.login("#####@gmail.com", "######")
client.select_folder('INBOX')
SEARCH_SINCE = (datetime.datetime.now() - datetime.timedelta( 4 )).date()
search_for = "Morning"
seq_nums = client.search([u'SUBJECT', f'{search_for}', u'SINCE', SEARCH_SINCE ])
print('seq nums', seq_nums)
for seqid, objs in client.fetch( seq_nums, [b'ENVELOPE', b'BODY[TEXT]']).items():
msg_body = email.message_from_bytes( objs[b'BODY[TEXT]'] )
envelope = objs[b'ENVELOPE']
print('subject', envelope.subject)
print('body', msg_body.get_payload()[:1000])
I also use red box and imap_tools to do this stuff but they are in order of magnitude slower than this IMAPClient method.