3

I want to separate replies and forwards from a thread of emails into conversations.

An example is like this:

On Jul 31, 2013, at 5:15 PM, John Doe wrote:

> example email text
>
>
> *From:* Me [mailto:me@gmail.com]
> *Sent:* Thursday, May 31, 2012 3:54 PM
> *To:* John Doe
> *Subject:* RE: subject
>
> example email text
>
>> Dear David,
>> 
>> Greetings from Doha!
>> Kindly enlighten me. I am confused.
>> 
>> With regards,
>> Smith
>>
>>> Dear Smith,
>>>
>>> Happy New year!
>>> Love
>>>
>>>> Dear Mr Wong,
>>>> Greetings!
>>>> Yours,
>>>> O

Above example is purely made up, but the format is quite true. Some emails contain multiple conversations.

I have tried https://github.com/zapier/email-reply-parser and other packages, but unfortunately they can not put into production as the performance is not stable.

The pattern is quite clear, the conversation can be separated by counting the number of ">". My initial idea is to go through the whole document, find out how many ">" are there and then extract each ">" ">>" ">>>" and ">>>>" as each conversation.

I want to know is there a better way out there?

Thank you very much!

Sean
  • 1,161
  • 1
  • 13
  • 24

1 Answers1

3

Here's one extremely simple solution with itertools.groupby assuming email bodies do not contain '>':

In [165]: for _, v in itertools.groupby(text.splitlines(), key=lambda x: x.count('>')):
     ...:     print('\n'.join(v))
     ...:     print('-' * 20)
     ...:     

groupby does the counting for you. You'll need something along the lines of key=lambda x: len(re.match(r'\>+', x).group(0)) for a more thorough solution.

Output:

> example email text
>
>
> *From:* Me [mailto:me@gmail.com]
> *Sent:* Thursday, May 31, 2012 3:54 PM
> *To:* John Doe
> *Subject:* RE: subject
>
> example email text
>
--------------------
>> Dear David,
>> 
>> Greetings from Doha!
>> Kindly enlighten me. I am confused.
>> 
>> With regards,
>> Smith
>>
--------------------
>>> Dear Smith,
>>>
>>> Happy New year!
>>> Love
>>>
--------------------
>>>> Dear Mr Wong,
>>>> Greetings!
>>>> Yours,
>>>> O
--------------------
cs95
  • 379,657
  • 97
  • 704
  • 746