Extract email addresses from academic curly braces format

Question

I have a file where each line contains a string that represents one or more email addresses. Multiple addresses can be grouped inside curly braces as follows:

{name.surname, name2.surnam2}@something.edu

Which means both addresses name.surname@something.edu and name2.surname2@something.edu are valid (this format is commonly used in scientific papers).

Moreover, a single line can also contain curly brackets multiple times. Example:

{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com

results in:

a.b@uni.somewhere 
c.d@uni.somewhere 
e.f@uni.somewhere
x.y@edu.com
z.k@edu.com

Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.

Can you elaborate on how your [mcve] "doesn't work" (and add it)? What were you expecting, and what actually happened? If you got an exception/error, post the line it occurred on and the exception/error details as per the How to Create a [mcve] page. Please [edit] your question to add these details into it or we may not be able to help. — Patrick Artner, Apr 01 '19 at 18:53

PaulMcG · Accepted Answer · 2019-04-02T12:44:57.730

Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).

pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:

import pyparsing as pp

LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}@')

# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE 
                    + pp.Group(pp.delimitedList(email_part))('names')
                    + RBRACE
                    + '@' 
                    + email_part('trailing'))

# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
    return ["{}@{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)

# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '@' + email_part)

# the complete list parser looks for a comma-delimited list of compressed 
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)

pyparsing parsers come with a runTests method to test your parser against various test strings:

tests = """\
    # original test string
    {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com

    # a tricky email containing a quoted string
    {x.y, z.k}@edu.com, "{a, b}"@domain.com

    # just a plain email
    plain_old_bob@uni.elsewhere

    # mixed list of plain and compressed emails
    {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
"""

email_list_parser.runTests(tests)

Prints:

# original test string
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com']

# a tricky email containing a quoted string
{x.y, z.k}@edu.com, "{a, b}"@domain.com
['x.y@edu.com', 'z.k@edu.com', '"{a, b}"@domain.com']

# just a plain email
plain_old_bob@uni.elsewhere
['plain_old_bob@uni.elsewhere']

# mixed list of plain and compressed emails
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com', 'plain_old_bob@uni.elsewhere']

DISCLOSURE: I am the author of pyparsing.

It sounds like `pyparsing` is effectively a library to simplify building a lexer / tokenizer as I had suggested. Definitely looks like it would be less code and more accurate than trying to wrap logic around regular expressions, and looks a lot easier than implementing an expression grammar by hand. — stevendesu, Apr 02 '19 at 02:39
Yes, PEG parsers are definitely an end run on the lex/yacc tradition. David Beazley's PLY package builds lex/yacc right into your Python code if you like. There are also a number of PEG parsing libs out there besides pyparsing. — PaulMcG, Apr 02 '19 at 13:58
Worked perfectly and indeed in this case is a much better approach than regexes — Gian Luca Scoccia, Apr 02 '19 at 14:41

score 1 · Answer 2 · answered Apr 01 '19 at 19:59

Note

I'm more familiar with JavaScript than Python, and the basic logic is the same regardless (the different is syntax), so I've written my solutions here in JavaScript. Feel free to translate to Python.

The Issue

This question is a bit more involved than a simple one-line script or regular expression, but depending on the specific requirements you may be able to get away with something rudimentary.

For starters, parsing an e-mail is not trivially boiled down to a single regular expression. This website has several examples of regular expressions that will match "many" e-mails, but explains the trade-offs (complexity versus accuracy) and goes on to include the RFC 5322 standard regular expression that should theoretically match any e-mail, followed by a paragraph for why you shouldn't use it. However even that regular expression assumes that a domain name taking the form of an IP address can only consist of a tuple of four integers ranging from 0 to 255 -- it doesn't allow for IPv6

Even something as simple as:

{a, b}@domain.com

Could get tripped up because technically according to the e-mail address specification an e-mail address can contain ANY ASCII characters surrounded by quotes. The following is a valid (single) e-mail address:

"{a, b}"@domain.com

To accurately parse an e-mail would require that you read the characters one letter at a time and build a finite state machine to track whether you are within a double-quote, within a curly brace, before the @, after the @, parsing a domain name, parsing an IP, etc. In this way you could tokenize the address, locate your curly brace token, and parse it independently.

Something Rudimentary

Regular expressions are not the way to go for 100% accuracy and support for all e-mails, *especially* if you want to support more than one e-mail on a single line. But we'll start with them and try to build from there.

You've probably tried a regular expression like:

/\{(([^,]+),?)+\}\@(\w+\.)+[A-Za-z]+/

Match a single curly brace...
Followed by one or more instances of:
- One or more non-comma characters...
- Followed by zero or one commas
Followed by a single closing curly brace...
Followed by a single @
Followed by one or more instances of:
- One or more "word" characters...
- Followed by a single .
Followed by one or more alpha characters

This should match something roughly of the form:

{one, two}@domain1.domain2.toplevel

This handles validating, next is the issue of extracting all valid e-mails. Note that we have two sets of parenthesis in the name portion of the e-mail address that are nested: (([^,]+),?). This causes a problem for us. Many regular expression engines don't know how to return matches in this case. Consider what happens when I run this in JavaScript using my Chrome developer console:

var regex = /\{(([^,]+),?)+\}\@(\w+\.)+[A-Za-z]+/
var matches = "{one, two}@domain.com".match(regex)
Array(4) [ "{one, two}@domain.com", " two", " two", "domain." ]

Well that wasn't right. It found two twice, but didn't find one once! To fix this, we need to eliminate the nesting and do this in two steps.

var regexOne = /\{([^}]+)\}\@(\w+\.)+[A-Za-z]+/
"{one, two}@domain.com".match(regexOne)
Array(3) [ "{one, two}@domain.com", "one, two", "domain." ]

Now we can use the match and parse that separately:

// Note: It's important that this be a global regex (the /g modifier) since we expect the pattern to match multiple times
var regexTwo = /([^,]+,?)/g
var nameMatches = matches[1].match(regexTwo)
Array(2) [ "one,", " two" ]

Now we can trim these and get our names:

nameMatches.map(name => name.replace(/, /g, "")
nameMatches
Array(2) [ "one", "two" ]

For constructing the "domain" part of the e-mail, we'll need similar logic for everything after the @, since this has a potential for repeats the same way the name part had a potential for repeats. Our final code (in JavaScript) may look something like this (you'll have to convert to Python yourself):

function getEmails(input)
{
    var emailRegex = /([^@]+)\@(.+)/;
    var emailParts = input.match(emailRegex);

    var name = emailParts[1];
    var domain = emailParts[2];

    var nameList;

    if (/\{.+\}/.test(name))
    {
        // The name takes the form "{...}"
        var nameRegex = /([^,]+,?)/g;
        var nameParts = name.match(nameRegex);
        nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
    }
    else
    {
        // The name is not surrounded by curly braces
        nameList = [name];
    }

    return nameList.map(name => `${name}@${domain}`);
}

Multi-email Lines

This is where things start to get tricky, and we need to accept a little less accuracy if we don't want to build a full on lexer / tokenizer. Because our e-mails contain commas (within the name field) we can't accurately split on commas -- unless those commas aren't within curly braces. With my knowledge of regular expressions, I don't know if this can be easily done. It may be possible with lookahead or lookbehind operators, but someone else will have to fill me in on that.

What can be easily done with regular expressions, however, is finding a block of text containing a post-ampersand comma. Something like: @[^@{]+?,

In the string a@b.com, c@d.com this would match the entire phrase @b.com, - but the important thing is that it gives us a place to split our string. The tricky bit is then finding out how to split your string here. Something along the lines of this will work most of the time:

var emails = "a@b.com, c@d.com"
var matches = emails.match(/@[^@{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(2) [ "a", " c@d.com" ]
split[0] = split[0] + matches[0] // Add back in what we split on

This has a potential bug should you have two e-mails in the list with the same domain:

var emails = "a@b.com, c@b.com, d@e.com"
var matches = emails.match(@[^@{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(3) [ "a", " c", " d@e.com" ]
split[0] = split[0] + matches[0]
console.log(split) // Array(3) [ "a@b.com", " c", " d@e.com" ]

But again, without building a lexer / tokenizer we're accepting that our solution will only work for most cases and not all.

However since the task of splitting one line into multiple e-mails is easier than diving into the e-mail, extracting a name, and parsing the name: we may be able to write a really stupid lexer for just this part:

var inBrackets = false
var emails = "{a, b}@c.com, d@e.com"
var split = []
var lastSplit = 0
for (var i = 0; i < emails.length; i++)
{
    if (inBrackets && emails[i] === "}")
        inBrackets = false;
    if (!inBrackets && emails[i] === "{")
        inBrackets = true;
    if (!inBrackets && emails[i] === ",")
    {
        split.push(emails.substring(lastSplit, i))
        lastSplit = i + 1 // Skip the comma
    }
}
split.push(emails.substring(lastSplit))
console.log(split)

Once again, this won't be a perfect solution because an e-mail address may exist like the following:

","@domain.com

But, for 99% of use cases, this simple lexer will suffice and we can now build a "usually works but not perfect" solution like the following:

function getEmails(input)
{
    var emailRegex = /([^@]+)\@(.+)/;
    var emailParts = input.match(emailRegex);

    var name = emailParts[1];
    var domain = emailParts[2];

    var nameList;

    if (/\{.+\}/.test(name))
    {
        // The name takes the form "{...}"
        var nameRegex = /([^,]+,?)/g;
        var nameParts = name.match(nameRegex);
        nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
    }
    else
    {
        // The name is not surrounded by curly braces
        nameList = [name];
    }

    return nameList.map(name => `${name}@${domain}`);
}

function splitLine(line)
{
    var inBrackets = false;
    var split = [];
    var lastSplit = 0;
    for (var i = 0; i < line.length; i++)
    {
        if (inBrackets && line[i] === "}")
            inBrackets = false;
        if (!inBrackets && line[i] === "{")
            inBrackets = true;
        if (!inBrackets && line[i] === ",")
        {
            split.push(line.substring(lastSplit, i));
            lastSplit = i + 1;
        }
    }
    split.push(line.substring(lastSplit));
    return split;
}

var line = "{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com";
var emails = splitLine(line);
var finalList = [];
for (var i = 0; i < emails.length; i++)
{
    finalList = finalList.concat(getEmails(emails[i]));
}
console.log(finalList);
// Outputs: [ "a.b@uni.somewhere", "c.d@uni.somewhere", "e.f@uni.somewhere", "x.y@edu.com", "z.k@edu.com" ]

If you want to try and implement the full lexer / tokenizer solution, you can look at the simple / dumb lexer I built as a starting point. The general idea is that you have a state machine (in my case I only had two states: inBrackets and !inBrackets) and you read one letter at a time but interpret it differently based on your current state.

I went with the python solution but thanks for the detailed step by step explanation — Gian Luca Scoccia, Apr 02 '19 at 14:41

Frenchy · Answer 3 · 2019-04-02T07:06:03.260

a quick solution using re:

test with one text line:

import re

line = '{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, {z.z, z.a}@edu.com'

com = re.findall(r'(@[^,\n]+),?', line)  #trap @xx.yyy
adrs = re.findall(r'{([^}]+)}', line)  #trap all inside { }
result=[]
for i  in range(len(adrs)):
    s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
    result=result+s.split(',')

for r in result:
    print(r)

output in list result:

a.b@uni.somewhere
c.d@uni.somewhere
e.f@uni.somewhere
x.y@edu.com
z.k@edu.com
z.z@edu.com
z.a@edu.com

test with a text file:

import io
data = io.StringIO(u'''\
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, {z.z, z.a}@edu.com
{a.b, c.d, e.f}@uni.anywhere
{x.y, z.k}@adi.com, {z.z, z.a}@du.com
''')

result=[]
import re
for line in data:
    com = re.findall(r'(@[^,\n]+),?', line)
    adrs = re.findall(r'{([^}]+)}', line)
    for i in range(len(adrs)):
        s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
        result = result + s.split(',')

for r in result:
    print(r)