0

Im not sure if this is google or email in general, but I'm seeing a some encoding that im not sure how to handle. Here is a snip that is the Washington post mailer form my google acct.

the subject

  b'=?UTF-8?Q?The_Morning:_Peru=E2=80=99s_deadly_protests?='

actually reads

The Morning: Peru’s deadly protests

part of the body.

<!DOCTYPE html><html xmlns=3D"http://www.w3.org/1999/xhtml" xmlns:v=3D"urn:=
schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-microsoft-com:office:offi=
ce"><head> <title>The Morning: Peru=E2=80=99s deadly protests</title> <!--[=
if !mso]><!-- --> <meta http-equiv=3D"X-UA-Compatible" content=3D"IE=3Dedge=
"> <!--<![endif]--> <meta http-equiv=3D"Content-Type" content=3D"text/html;=
 charset=3DUTF-8"> <meta name=3D"viewport" content=3D"width=3Ddevice-width,=
initial-scale=3D1"> <style type=3D"text/css">#outlook a{padding:0}body{marg=
in:0;padding:0;-webkit-text-size-adjust:100%;-ms-text-size-adjust:100%}tabl=
e,td{border-collapse:collapse;mso-table-lspace:0;mso-table-rspace:0}img{bor=
der:0;height:auto;line-height:100%;outline:0;text-decoration:none;-ms-inter=
polation-mode:bicubic}p{display:block;margin:13px 0}p,ul{margin-top:0}@medi=
a (max-width:600px){body{padding:0 15px!important}}</style> <!--[if mso]>=
=0A        <xml>=0A        <o:OfficeDocumentSettings>=0A   

 

Everything is line wrapped with a '=\f\n' which is no problem to remove with

bodytext = ''.join( bodytext.split('=\r\n') )

But there is other stuff in there like =3D =0D=0A and yeh they are ascii encodings but what decoder library do we use to decode this?

For reference or to try yourself: here is the python code.

import email, datetime
from imapclient import IMAPClient


with IMAPClient('imap.gmail.com', port=None, use_uid=True, ssl=True, stream=False, ssl_context=None, timeout=None) as client:
    
    client.login("#####@gmail.com", "######")
    client.select_folder('INBOX')

    SEARCH_SINCE = (datetime.datetime.now() - datetime.timedelta( 4 )).date()
    search_for = "Morning"

    seq_nums = client.search([u'SUBJECT', f'{search_for}', u'SINCE', SEARCH_SINCE ])
    print('seq nums', seq_nums)
    for seqid, objs in client.fetch( seq_nums, [b'ENVELOPE', b'BODY[TEXT]']).items():
        
        msg_body = email.message_from_bytes( objs[b'BODY[TEXT]'] )
        envelope = objs[b'ENVELOPE']
        print('subject', envelope.subject)
        
        print('body', msg_body.get_payload()[:1000])

I also use red box and imap_tools to do this stuff but they are in order of magnitude slower than this IMAPClient method.

Peter Moore
  • 1,632
  • 1
  • 17
  • 31
  • 1
    The subject uses [encoded word](https://en.wikipedia.org/wiki/MIME#Encoded-Word), which you need for non-ascii characters. The body is [quoted-printable](https://en.wikipedia.org/wiki/Quoted-printable). There should be libraries that handle these details for you. – Robert Jan 26 '23 at 19:25
  • @Robert Thank you Robert! With those hints i found the solution. – Peter Moore Jan 26 '23 at 20:04
  • @PeterMoore, create issue at imap_tools about lib speed, lets see. – Vladimir Jan 27 '23 at 04:04
  • Hi @Vladimir mailbox.fetch(ids) and client.fetch(ids) with same 50 ids code below is 2 seconds and imap_tools was 0.7 ~ish seconds per email. if you can post a faster example then go for it. – Peter Moore Jan 27 '23 at 14:57
  • @PeterMoore, fetch has bulk arg – Vladimir Jan 30 '23 at 03:21

1 Answers1

0

Ok thank you Robert for the 2 links that described the encoding. No less than 2 different parsing methods. The working code follows:

import email, datetime
from imapclient import IMAPClient
from email.header import decode_header, make_header
import quopri


with IMAPClient('imap.gmail.com', port=None, use_uid=True, ssl=True, stream=False, ssl_context=None, timeout=None) as client:
    client.login("#####@gmail.com", "######")
    client.select_folder('INBOX')

    SEARCH_SINCE = (datetime.datetime.now() - datetime.timedelta( 4 )).date()
    search_for = "Morning"

    seq_nums = client.search([u'SUBJECT', f'{search_for}', u'SINCE', SEARCH_SINCE ])
    print('seq nums', seq_nums)
    for seqid, objs in client.fetch( seq_nums, [b'ENVELOPE', b'BODY[TEXT]']).items():
        
        msg_body = email.message_from_bytes( objs[b'BODY[TEXT]'] )
        envelope = objs[b'ENVELOPE']
        print('subject', make_header(decode_header(envelope.subject.decode('utf-8'))))
        x = quopri.decodestring(objs[b'BODY[TEXT]'])
        print(x.decode('utf-8')[:1000])
       
Peter Moore
  • 1,632
  • 1
  • 17
  • 31