0

I am using following python regex code to analyze values from the To field of an email:

import re

PATTERN = re.compile(r'''((?:[^(;|,)"']|"[^"]*"|'[^']*')+)''')
list = PATTERN.split(raw)[1::2]

The list should output the name and address of each recipient, based on either "," or ";" as seperator. If these values are within quotes, they are to be ignorded, this is part of the name, often: "Last Name, First Name"

Most of the times this works well, however in the following case I am getting unexpected behaviour:

"Some Name | Company Name" <name@example.com>

In this case it is splitting on the "|" character. Even though when I check the pattern on regex tester websites, it selects the name and address as a whole. What am I doing wrong?

Example input would be:

"Some Name | Company Name" <name1@example.com>, "Some Other Name | Company Name" <name2@example.com>, "Last Name, First Name" <name3@example.com>
Vincent
  • 1,137
  • 18
  • 40
  • It doesn't split anywhere. Gives me an output `['"Some Name | Company Name" ']` – nu11p01n73R Jun 10 '15 at 05:54
  • That's correct, normally there would be multiple of these in a string. I want to single them out. However if I run it on my google app engine, it splits on the | – Vincent Jun 10 '15 at 05:55

2 Answers2

2

This is not a direct answer to your question but to the problem you seem to be solving and therefore maybe still helpful:

To parse emails I always make extensive use of Python's email library.

In your case you could use something like this:

from email.utils import getaddresses
from email import message_from_string

msg = message_from_string(str_with_msg_source)
tos = msg.get_all('to', [])
ccs = msg.get_all('cc', [])
resent_tos = msg.get_all('resent-to', [])
resent_ccs = msg.get_all('resent-cc', [])
all_recipients = getaddresses(tos + ccs + resent_tos + resent_ccs)
for (name, address) in all_recipients:
    # do some postprocessing on name or address if necessary

This always took reliable care of splitting names and addresses in mail headers in my cases.

myke
  • 479
  • 3
  • 14
  • This seems like a great suggestion. I am using a webhook to parse the emails. Can I also feed the string value from a To-field this? – Vincent Jun 10 '15 at 06:41
  • Not sure if I understand correctly. But if you ask whether you can use this for single strings, then yes, sure, e.g.: `email.utils.parseaddr('"Mr Smith | Something" ')` gives you `('Mr Smith | Something', 'smith@example.com')`. – myke Jun 10 '15 at 06:47
  • 1
    Got it. Found the answer based on this here as well: http://stackoverflow.com/questions/5426789/method-for-parsing-text-cc-field-of-email-header – Vincent Jun 10 '15 at 07:05
1

You can use a much simpler regex using look arounds to split the text.

r'(?<=>)\s*,\s*(?=")'

Regex Explanation

  • \s*,\s* matches , which is surrounded by zero or more spaces (\s*)

  • (?<=>) Look behind assertion. Checks if the , is preceded by a >

  • (?=") Look ahead assertion. Checks if the , is followed by a "

Test

>>> re.split(r'(?<=>)\s*,\s*(?=")', string)
['"Some Name | Company Name" <name1@example.com>', '"Some Other Name | Company Name" <name2@example.com>', '"Last Name, First Name" <name3@example.com>']

Corrections

  • Case 1 In the above example, we used a single delimiter ,. If yo wish to split on basis of more than one delimiters you can use a character class

    r'(?<=>)\s*[,;]\s*(?=")'
    
    • [,;] Character class, matches , or ;

  • Case 2 As mentioned in comments, if the address part is missing, all we need to do is to add " to the look behind

    Example

    >>> string = '"Some Other Name | Company Name" <name2@example.com>, "Some Name, Nothing", "Last Name, First Name" <name3@example.com>' 
    
    >>> re.split(r'(?<=(?:>|"))\s*[,;]\s*(?=")', string)
    ['"Some Other Name | Company Name" <name2@example.com>', '"Some Name, Nothing"', '"Last Name, First Name" <name3@example.com>']
    
nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
  • For this case yes, but sometimes there is no name included, and just "name@example.com" without the <>. Therefore I figured would need to split on all , or ; outside of quotes – Vincent Jun 10 '15 at 06:03
  • @Vincent That can also be done. In which case modify the look behind as `(?<=(>|")` so that it looks for `>` or `"` before the `,` – nu11p01n73R Jun 10 '15 at 06:06
  • @Vincent I have added an edit. Do see the case 2 section. I hope that is what you are talking about – nu11p01n73R Jun 10 '15 at 06:08
  • Thanks a lot for your help. Final question: what do I include to match if it is a single email address without quotes? – Vincent Jun 10 '15 at 06:10
  • So this would be the final example string: ''"Some Other Name | Company Name" , "Some Name, Nothing", , name5@example.com, "Last Name, First Name" ' – Vincent Jun 10 '15 at 06:14
  • And I noticed your suggestion has some loose "<" and '"' in there. Is there a way to prevent that? – Vincent Jun 10 '15 at 06:14
  • @Vincent Sorry about that, It was because of the capturing groups( the `,` and `>`) I have corected that one in the answer. – nu11p01n73R Jun 10 '15 at 06:22
  • @Vincent I think your second example splits fine with the last regex in case 2. – nu11p01n73R Jun 10 '15 at 06:25
  • I just tried it with a single address and unfortunately it returns an empty list. I also just noticed that some email providers return names in single quotes. I guess it would make sense to check out Myke's suggestion on the included email library in Python, as there are probably more exceptions I don't know about. Neverthless thanks a lot for your help! – Vincent Jun 10 '15 at 06:41