I want to convert a MBOX file to CSV for analysis purposes.
There are various tools available to do the same, but I want a python code for it.
I tried converting it into a Pandas DataFrame and then exporting it into CSV.
Here's the code for that:
import pandas as pd
import mailbox
MBOX = '/content/XYZ.mbox'
mbox = mailbox.mbox(MBOX)
mbox_dict = {}
for i, msg in enumerate(mbox):
mbox_dict[i] = {}
for header in msg.keys():
mbox_dict[i][header] = msg[header]
mbox_dict[i]['Body'] = msg.as_string().replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').strip()
df = pd.DataFrame.from_dict(mbox_dict, orient='index')
df.to_csv('XYZ.csv')
It worked fine, except for the Body column. See image for reference:
Body Column of the CSV file
It's supposed to contain just the body content, but instead, there's the whole MBOX message dump (except for the From: ). Also, there are contents in the body that are included in mails that are "replied to". I don't want that content in the body either.
How to extract just the body content from the MBOX Message?