0

I want to convert a MBOX file to CSV for analysis purposes.
There are various tools available to do the same, but I want a python code for it.

I tried converting it into a Pandas DataFrame and then exporting it into CSV.
Here's the code for that:

import pandas as pd
import mailbox

MBOX = '/content/XYZ.mbox'
mbox = mailbox.mbox(MBOX)

mbox_dict = {}
for i, msg in enumerate(mbox):
    mbox_dict[i] = {}
    for header in msg.keys():
        mbox_dict[i][header] = msg[header]
    mbox_dict[i]['Body'] = msg.as_string().replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').strip()
    
df = pd.DataFrame.from_dict(mbox_dict, orient='index')

df.to_csv('XYZ.csv')

It worked fine, except for the Body column. See image for reference:
Body Column of the CSV file

It's supposed to contain just the body content, but instead, there's the whole MBOX message dump (except for the From: ). Also, there are contents in the body that are included in mails that are "replied to". I don't want that content in the body either.

How to extract just the body content from the MBOX Message?

0 Answers0