0

I'm working with the enron dataset, and I'm interested on extract the clean body of the emails to a list keeping each answer as a string in the list. E.G.

For the following email:

Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf

So, what is it?   And by the way, don't start with the excuses.   You're 
expected to be a full, gourmet cook.

Kisses, not music, makes cooking a more enjoyable experience.  




"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi


I told you I have a long email address.

I've decided what to prepare for dinner tomorrow.  I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience.

Watch the debate if you are home tonight.  I want a report tomorrow...
Jen

___________________________________________________________________
To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,
all in one place - sign up today at http://www.zdnetonebox.com

I want to get the following response:

["So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience.", 
"I told you I have a long email address. I've decided what to prepare for dinner tomorrow.  I hope you aren't 
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience. Watch the debate if you are home tonight.  I want a report tomorrow...
Jen"]

Where the first element in the list is:

"So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience."

Is there a library capable of doing this?

I have tried with the python email library, but I does not seem to have that functionality, since I get the full body as response:

import email
message = data_
e = email.message_from_string(message)
print (e.get_payload())

So, what is it? And by the way, don't start with the excuses.
You're \nexpected to be a full, gourmet cook.\n\nKisses, not music, makes cooking a more enjoyable experience. \n\n\n\n\n"Jennifer White" jenwhite7@zdnetonebox.com on 10/17/2000 04:19:20 PM\nTo: jarnold@enron.com\ncc: \nSubject: Hi\n\n\nI told you I have a long email address.\n\nI've decided what to prepare for dinner tomorrow. I hope you aren't\nexpecting anything extravagant because my culinary skills haven't been\nput to use in a while. My only request is that your stereo works. Music\nmakes cooking a more enjoyable experience.\n\nWatch the debate if you are home tonight. I want a report tomorrow...\nJen\n\n___________________________________________________________________\nTo get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,\nall in one place - sign up today at http://www.zdnetonebox.com\n\n\n'

jsbueno
  • 99,910
  • 10
  • 151
  • 209
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • Did my answer solve your problem? If so, please accept my [answer](https://meta.stackoverflow.com/q/5234/234215). If not, please follow-up specifically so any outstanding concerns can be addressed. Thanks – Life is complex Jan 27 '21 at 13:24

3 Answers3

2

I'm going to assume that you have all the Enron email messages in a .csv file, which is a common format for this dataset. I noted some data cleansing issues when processing this single message, mostly around the the "\n" in the message. I'm trying to figure out how to resolve this small issue.

import re as regex

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
      return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))


def parse_raw_email_message(raw_message):
   lines = raw_message.splitlines()
   email = {}
   message = ''
   keys_to_extract = ['from', 'to']
   for line in lines:
      if ':' not in line:
        message += line
        email['body'] = message

      else:
         pairs = line.split(':')
         key = pairs[0].lower()
         val = pairs[1].strip()
         if key in keys_to_extract:
            email[key] = val
   return email

###############################################
# change this open section to fit your dataset
###############################################
with open('enron_emails/sample_email.txt', 'r') as in_file:
   parsed_email = parse_raw_email_message(in_file.read())
   for key, value in parsed_email.items():
     if key == "body":
        # this regex add whitespace around single periods and words that end in 't.
        first_cleaning = regex.sub(r"(?<=('t)(?=[^\s]))|(?<=[.,])(?=[^\s])", r' ', value)
        cleaned_body = expunge_doublespaces(first_cleaning)
        print(cleaned_body)
        # print output
        So, what is it? And by the way, don't start with the excuses. You're
        expected to be a full, gourmet cook. Kisses, not music, makes cooking
        a more enjoyable experience. I told you I have a long email address.
        I've decided what to prepare for dinner tomorrow. I hope you aren't
        expecting anything extravagant because my culinary skills haven't 
        beenput to use in a while. My only request is that your stereo works. 
        Musicmakes cooking a more enjoyable experience. Watch the debate if 
        you are home tonight. I want a report tomorrow. . . Jen

UPDATE

Here is another way to obtain the body of the email message. There are other examples in another question that I answered.

import re as regex
import email

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
     return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))

with open('enron_emails/sample_email.txt', 'r') as input:
    email_body = ''
    raw_message = input.read()

    # Return a message object structure from a string
    msg = email.message_from_string(raw_message)

    # iterate over all the parts and subparts of a message object tree
    for part in msg.walk():

    # Return the message’s content type.
    if part.get_content_type() == 'text/plain':
      email_body = part.get_payload()
      first_cleaning = regex.sub(r"((\W\w+\W).*(\d{2}:\d{2}:\d{2})\s(AM|PM)\n(To:.*)\n(cc:.*)\n(Subject:.*))", r' ',
                     email_body)
      clean_body = expunge_doublespaces(first_cleaning.replace('\n', ' '))
      print(clean_body)
      # print output
      So, what is it? And by the way, don't start with the excuses. 
      You're expected to be a full, gourmet cook. Kisses, not music, 
      makes cooking a more enjoyable experience. I told you I have a 
      long email address. I've decided what to prepare for dinner 
      tomorrow. I hope you aren't expecting anything extravagant 
      because my culinary skills haven't been put to use in a while. 
      My only request is that your stereo works. Music makes cooking a 
      more enjoyable experience. Watch the debate if you are home 
      tonight. I want a report tomorrow... Jen 
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Hello, I have Raw_mwssage in dataframe as column. I am trying to split each mail into seperate rows, my question here https://stackoverflow.com/questions/76211703/extract-chain-of-email-text-into-multiple-rows#76212432. Would you pls help me on that – Tpk43 May 10 '23 at 16:11
1

You have a couple of options here as I see.

  1. Go with what you have but add in a second function doing a bit of text processing. For example text = re.sub(r'[\s]+',' ',text) to remove the occurrences of \n and then presumably on to fixing all the cases of \' as well as everything from the divider line down. This seems to be the easiest solution but has limitations, all of which (from your example) can be taken care of with some trickery with regex/grep/awk.

  2. Another library (as you asked). I'm aware of SpamScope - You probably can guess what it does from the name but it also parse email off their RFC headers. Again, some post-processing may be in order but it looks like a combination of using the headers (e.g. Date and Body) should do much of what you need.

  3. Webservices such as Zapier's parser.

hrokr
  • 3,276
  • 3
  • 21
  • 39
-1

I'm sorry but your current email format is impossible to decode because there is no way to differentiate the headers of the email

"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi

from part of the email because the actual part of the email could have that string for some reason and how would you tell which one was the real body or header part.

Timothy Chen
  • 431
  • 3
  • 8
  • 1
    that actually follows the specification for e-mails - that makes the headers distinct from the body is a blank line with a \r\n (CRLF) - check: https://tools.ietf.org/html/rfc2822#section-3.5 – jsbueno Dec 06 '20 at 03:35
  • the problem there being that the e-mail body is really free-formatted, and some fuzzy things are needed to be done. The inner e-mail could be quoted in a miriad of different ways. This will always be an uphill thing. So, while it is "impossible" to decode with precision, a nice algorithm treating the most common quoting styles could get more than 99% of the e-mails correctly sorted. If one part of the message resembles so much an actual quoting as to be interpreted as being one, maybe it just _should_ be counted as a quote. – jsbueno Dec 06 '20 at 03:41
  • but also, according to the question, the actual email does not include \r\n, only \n. – Timothy Chen Dec 07 '20 at 00:06