I really need your help in solving a problem! Apparently, my knowledge is not sufficient to find a solution. So, I have some msg files that I have already created and saved. Now I need to write a function that can help me create pdfs from msg files (there will be many of them). I'd be very grateful for your help!
4 Answers
Posting the solution which worked for me (as asked by Amey P Naik). As mentioned I tried multiple modules but only extract_msg worked for the case in hand. I created two functions for importing the outlook message text and attachments as a Pandas DataFrame, first function would create one folder each for the email message and second would import the data from message to dataframe. Attachments need to be processed separately using for loop on the sub-directories in the parent directory. Below are the two functions I created with comments:
# 1). Import the required modules and setup working directory
import extract_msg
import os
import pandas as pd
direct = os.getcwd() # directory object to be passed to the function for accessing emails, this is where you will store all .msg files
ext = '.msg' #type of files in the folder to be read
# 2). Create separate folder by email name and extract data
def content_extraction(directory,extension):
for mail in os.listdir(directory):
try:
if mail.endswith(extension):
msg = extract_msg.Message(mail) #This will create a local 'msg' object for each email in direcory
msg.save() #This will create a separate folder for each email inside the parent folder and save a text file with email body content, also it will download all attachments inside this folder.
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass # Using this as some emails are not processed due to different formats like, emails sent by mobile.
content_extraction(direct,ext)
#3).Import the data to Python DataFrame using the extract_msg module
#note this will not import data from the sub-folders inside the parent directory
#rather it will extract the information from .msg files, you can use a loop instead
#to directly import data from the files saved on sub-folders.
def DataImporter(directory, extension):
my_list = []
for i in os.listdir(direct):
try:
if i.endswith(ext):
msg = extract_msg.Message(i)
my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body, msg.message_id]) #These are in-built features of '**extract_msg.Message**' class
global df
df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text','Message ID'])
print(df.shape[0],' rows imported')
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass
DataImporter(direct,ext)
Post running these 2 functions, you will have almost all information inside a Pandas DataFrame, which you can use as per your need. If you also need to extract content from attachments, you need to create a loop for all sub-directories inside the parent directory to read the attachment files as per their format, like in my case the formats were .pdf,.jpg,.png,.csv etc. Getting data from these format will require different techniques like for getting data from pdf you will need Pytesseract OCR module.
If you find an easier way to extract content from attachments, please post your solution here for future reference, if you have any questions, please comment. Also if there is any scope of improvement in the above code, please feel free to highlight.

- 56
- 1
- 6
Just for the record as I just tried this approach: extract_msg meanwhile supports native generation of pdf files with a command like this:
python -m extract_msg --pdf email.msg

- 23
- 5
After trying many approaches such as aspose, msg2pdf, pywin32 and few more modules/packages. I concluded the below approach is worked for me.
WeasyPrint is a smart solution helping web developers to create PDF documents.
Extracts emails and attachments saved in Microsoft Outlookâs .msg files
install required Modules
!pip install weasyprint #
!pip install extract-msg==0.41.1 #
Import Required Modules
import extract_msg
from weasyprint import HTML
Converting msg to pdf
# Reading msg file
msg = extract_msg.openMsg("c:/abcd/testing.msg")
# saving as html format
with open("c:/abcd/test_case.html","wb") as file:
file.write(msg.getSaveHtmlBody())
# to create PDF documents from HTML
HTML("c:/abcd/test_case.html").write_pdf("c:/abcd/test_case_output.pdf")

- 1,673
- 22
- 32
Let me offer you a solution using the Aspose APIs. An alternative approach based on using the Aspose.Email and Aspose.Words libraries, which are powerful tools for working with email and document conversions.
import aspose.email as ae
import aspose.words as aw
import io
# Load a MSG file using Aspose.Email
msg = ae.MailMessage.load("test.msg")
stream = io.BytesIO()
# Save the message as HTML to a stream
msg.save(stream, ae.SaveOptions.default_html)
stream.seek(0)
# Load an HTML from stream using Aspose.Words
doc = aw.Document(stream)
# Save a document as PDF
doc.save("output.pdf")
stream.close()
We utilize Aspose.Email to load the .msg file, save the email as HTML, and then use Aspose.Words to convert the HTML to PDF.
To run this code, make sure you have installed both the Aspose.Email and Aspose.Words Python packages using pip:
pip install Aspose.Email-for-Python-via-NET
pip install aspose-words