-1

I really need your help in solving a problem! Apparently, my knowledge is not sufficient to find a solution. So, I have some msg files that I have already created and saved. Now I need to write a function that can help me create pdfs from msg files (there will be many of them). I'd be very grateful for your help!

bella
  • 93
  • 1
  • 7

4 Answers4

2

Posting the solution which worked for me (as asked by Amey P Naik). As mentioned I tried multiple modules but only extract_msg worked for the case in hand. I created two functions for importing the outlook message text and attachments as a Pandas DataFrame, first function would create one folder each for the email message and second would import the data from message to dataframe. Attachments need to be processed separately using for loop on the sub-directories in the parent directory. Below are the two functions I created with comments:

 # 1). Import the required modules and setup working directory
    
    import extract_msg
    import os
    import pandas as pd
    direct = os.getcwd() # directory object to be passed to the function for accessing emails, this is where you will store all .msg files
    ext = '.msg' #type of files in the folder to be read
    
    # 2). Create separate folder by email name and extract data 
    
    def content_extraction(directory,extension):
        for mail in os.listdir(directory):
            try:
                if mail.endswith(extension):
                    msg = extract_msg.Message(mail) #This will create a local 'msg' object for each email in direcory
                    msg.save() #This will create a separate folder for each email inside the parent folder and save a text file with email body content, also it will download all attachments inside this folder.            
            except(UnicodeEncodeError,AttributeError,TypeError) as e:
                pass # Using this as some emails are not processed due to different formats like, emails sent by mobile.
    
    content_extraction(direct,ext)

#3).Import the data to Python DataFrame using the extract_msg module
#note this will not import data from the sub-folders inside the parent directory 
#rather it will extract the information from .msg files, you can use a loop instead 
#to directly import data from the files saved on sub-folders.

def DataImporter(directory, extension):
    my_list = []
    for i in os.listdir(direct):
        try:
            if i.endswith(ext):
                msg = extract_msg.Message(i)
                my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body, msg.message_id]) #These are in-built features of '**extract_msg.Message**' class
                global df
                df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text','Message ID'])
                print(df.shape[0],' rows imported')
        except(UnicodeEncodeError,AttributeError,TypeError) as e:
            pass

DataImporter(direct,ext)

Post running these 2 functions, you will have almost all information inside a Pandas DataFrame, which you can use as per your need. If you also need to extract content from attachments, you need to create a loop for all sub-directories inside the parent directory to read the attachment files as per their format, like in my case the formats were .pdf,.jpg,.png,.csv etc. Getting data from these format will require different techniques like for getting data from pdf you will need Pytesseract OCR module.

If you find an easier way to extract content from attachments, please post your solution here for future reference, if you have any questions, please comment. Also if there is any scope of improvement in the above code, please feel free to highlight.

0

Just for the record as I just tried this approach: extract_msg meanwhile supports native generation of pdf files with a command like this:

python -m extract_msg --pdf email.msg

pilz
  • 23
  • 5
0

After trying many approaches such as aspose, msg2pdf, pywin32 and few more modules/packages. I concluded the below approach is worked for me.

WeasyPrint is a smart solution helping web developers to create PDF documents.

Extracts emails and attachments saved in Microsoft Outlook’s .msg files

install required Modules

!pip install weasyprint #
!pip install extract-msg==0.41.1 #

Import Required Modules

import extract_msg
from weasyprint import HTML

Converting msg to pdf

# Reading msg file
msg = extract_msg.openMsg("c:/abcd/testing.msg")

# saving as html format
with open("c:/abcd/test_case.html","wb") as file:
    file.write(msg.getSaveHtmlBody())

#  to create PDF documents from HTML
HTML("c:/abcd/test_case.html").write_pdf("c:/abcd/test_case_output.pdf")
thrinadhn
  • 1,673
  • 22
  • 32
0

Let me offer you a solution using the Aspose APIs. An alternative approach based on using the Aspose.Email and Aspose.Words libraries, which are powerful tools for working with email and document conversions.

import aspose.email as ae
import aspose.words as aw
import io

# Load a MSG file using Aspose.Email
msg = ae.MailMessage.load("test.msg")
stream = io.BytesIO()
# Save the message as HTML to a stream
msg.save(stream, ae.SaveOptions.default_html)
stream.seek(0)
# Load an HTML from stream using Aspose.Words
doc = aw.Document(stream)
# Save a document as PDF
doc.save("output.pdf")
stream.close()

We utilize Aspose.Email to load the .msg file, save the email as HTML, and then use Aspose.Words to convert the HTML to PDF.

To run this code, make sure you have installed both the Aspose.Email and Aspose.Words Python packages using pip:

pip install Aspose.Email-for-Python-via-NET
pip install aspose-words