4

I need to migrate an email database to a CRMand have 2 problems:

I get to access the mbox file but the content is not properly decoded.

I want to create a dataframe like structure with following columns: "date, from, to, subject, body"

I have tried the following:

for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = (part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

and get the following output:

from   : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <gonzalo.gasset@baud.es>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from   : Mailtrack Reminder <reminders@mailtrack.io>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
 para nuevo proyecto
content: b'<!DOCTYPE html>\r\n<html>\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width">\r\n    <title>Reminder</title>\r\n</head>\r\n<style media="screen">\r\n    body {\r\n        font-family: Helvetica;\r\n    }\r\n</style>\r\n<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Lucas
  • 549
  • 1
  • 4
  • 16

1 Answers1

6

The concrete implementations of mailbox.Mailbox accept a factory argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessages which will decode headers and body text automatically.

Selecting the actual body is trickier, and perhaps depends on your particular requirements. In the code sample below, any "text" type parts are joined together, while non-text parts are rejected. You might wish to apply your own selection criteria.

from email.parser import BytesParser
from email.policy import default
import mailbox

mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mbox):
    print("date:  :", message['date'])
    print("to:    :", message['to'])
    print("from   :", message['from'])
    print("subject:", message['subject'])
    if message.is_multipart():
        contents = []
        for part in message.walk():
            maintype = part.get_content_maintype()
            if maintype == 'multipart' or maintype != 'text':
                # Reject containers and non-text types
                continue
            contents.append(part.get_content())
        content = '\n\n'.join(contents)
    else:
        content = message.get_content()
    print("content:", content)
    print("**************************************")
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • I managed to read most of the '.mbox' files but I had an error on a 31Gb size file. giving me the following error: – Lucas Aug 11 '20 at 20:45
  • LookupError: unknown encoding: – Lucas Aug 11 '20 at 20:45
  • 1
    I'd suggest skipping such messages using `try/except` for now. If you can dump the bytes of such a message to a file you could ask a new question, including the bytes as evidence. – snakecharmerb Aug 12 '20 at 07:34
  • If the encoding is actually a valid IANA encoding but Python calls it something else (there are a couple of cases of this), you may have to hack the email character set aliases in unobvious ways. IIRC there is a duplicate about that, or a bug in the Python bug tracker. – tripleee Jan 23 '22 at 19:03