Note
I'm more familiar with JavaScript than Python, and the basic logic is the same regardless (the different is syntax), so I've written my solutions here in JavaScript. Feel free to translate to Python.
The Issue
This question is a bit more involved than a simple one-line script or regular expression, but depending on the specific requirements you may be able to get away with something rudimentary.
For starters, parsing an e-mail is not trivially boiled down to a single regular expression. This website has several examples of regular expressions that will match "many" e-mails, but explains the trade-offs (complexity versus accuracy) and goes on to include the RFC 5322 standard regular expression that should theoretically match any e-mail, followed by a paragraph for why you shouldn't use it. However even that regular expression assumes that a domain name taking the form of an IP address can only consist of a tuple of four integers ranging from 0 to 255 -- it doesn't allow for IPv6
Even something as simple as:
{a, b}@domain.com
Could get tripped up because technically according to the e-mail address specification an e-mail address can contain ANY ASCII characters surrounded by quotes. The following is a valid (single) e-mail address:
"{a, b}"@domain.com
To accurately parse an e-mail would require that you read the characters one letter at a time and build a finite state machine to track whether you are within a double-quote, within a curly brace, before the @
, after the @
, parsing a domain name, parsing an IP, etc. In this way you could tokenize the address, locate your curly brace token, and parse it independently.
Something Rudimentary
Regular expressions are not the way to go for 100% accuracy and support for all e-mails, *especially* if you want to support more than one e-mail on a single line. But we'll start with them and try to build from there.
You've probably tried a regular expression like:
/\{(([^,]+),?)+\}\@(\w+\.)+[A-Za-z]+/
- Match a single curly brace...
- Followed by one or more instances of:
- One or more non-comma characters...
- Followed by zero or one commas
- Followed by a single closing curly brace...
- Followed by a single
@
- Followed by one or more instances of:
- One or more "word" characters...
- Followed by a single
.
- Followed by one or more alpha characters
This should match something roughly of the form:
{one, two}@domain1.domain2.toplevel
This handles validating, next is the issue of extracting all valid e-mails. Note that we have two sets of parenthesis in the name portion of the e-mail address that are nested: (([^,]+),?)
. This causes a problem for us. Many regular expression engines don't know how to return matches in this case. Consider what happens when I run this in JavaScript using my Chrome developer console:
var regex = /\{(([^,]+),?)+\}\@(\w+\.)+[A-Za-z]+/
var matches = "{one, two}@domain.com".match(regex)
Array(4) [ "{one, two}@domain.com", " two", " two", "domain." ]
Well that wasn't right. It found two
twice, but didn't find one
once! To fix this, we need to eliminate the nesting and do this in two steps.
var regexOne = /\{([^}]+)\}\@(\w+\.)+[A-Za-z]+/
"{one, two}@domain.com".match(regexOne)
Array(3) [ "{one, two}@domain.com", "one, two", "domain." ]
Now we can use the match and parse that separately:
// Note: It's important that this be a global regex (the /g modifier) since we expect the pattern to match multiple times
var regexTwo = /([^,]+,?)/g
var nameMatches = matches[1].match(regexTwo)
Array(2) [ "one,", " two" ]
Now we can trim these and get our names:
nameMatches.map(name => name.replace(/, /g, "")
nameMatches
Array(2) [ "one", "two" ]
For constructing the "domain" part of the e-mail, we'll need similar logic for everything after the @
, since this has a potential for repeats the same way the name part had a potential for repeats. Our final code (in JavaScript) may look something like this (you'll have to convert to Python yourself):
function getEmails(input)
{
var emailRegex = /([^@]+)\@(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}@${domain}`);
}
Multi-email Lines
This is where things start to get tricky, and we need to accept a little less accuracy if we don't want to build a full on lexer / tokenizer. Because our e-mails contain commas (within the name field) we can't accurately split on commas -- unless those commas aren't within curly braces. With my knowledge of regular expressions, I don't know if this can be easily done. It may be possible with lookahead or lookbehind operators, but someone else will have to fill me in on that.
What can be easily done with regular expressions, however, is finding a block of text containing a post-ampersand comma. Something like: @[^@{]+?,
In the string a@b.com, c@d.com
this would match the entire phrase @b.com,
- but the important thing is that it gives us a place to split our string. The tricky bit is then finding out how to split your string here. Something along the lines of this will work most of the time:
var emails = "a@b.com, c@d.com"
var matches = emails.match(/@[^@{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(2) [ "a", " c@d.com" ]
split[0] = split[0] + matches[0] // Add back in what we split on
This has a potential bug should you have two e-mails in the list with the same domain:
var emails = "a@b.com, c@b.com, d@e.com"
var matches = emails.match(@[^@{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(3) [ "a", " c", " d@e.com" ]
split[0] = split[0] + matches[0]
console.log(split) // Array(3) [ "a@b.com", " c", " d@e.com" ]
But again, without building a lexer / tokenizer we're accepting that our solution will only work for most cases and not all.
However since the task of splitting one line into multiple e-mails is easier than diving into the e-mail, extracting a name, and parsing the name: we may be able to write a really stupid lexer for just this part:
var inBrackets = false
var emails = "{a, b}@c.com, d@e.com"
var split = []
var lastSplit = 0
for (var i = 0; i < emails.length; i++)
{
if (inBrackets && emails[i] === "}")
inBrackets = false;
if (!inBrackets && emails[i] === "{")
inBrackets = true;
if (!inBrackets && emails[i] === ",")
{
split.push(emails.substring(lastSplit, i))
lastSplit = i + 1 // Skip the comma
}
}
split.push(emails.substring(lastSplit))
console.log(split)
Once again, this won't be a perfect solution because an e-mail address may exist like the following:
","@domain.com
But, for 99% of use cases, this simple lexer will suffice and we can now build a "usually works but not perfect" solution like the following:
function getEmails(input)
{
var emailRegex = /([^@]+)\@(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}@${domain}`);
}
function splitLine(line)
{
var inBrackets = false;
var split = [];
var lastSplit = 0;
for (var i = 0; i < line.length; i++)
{
if (inBrackets && line[i] === "}")
inBrackets = false;
if (!inBrackets && line[i] === "{")
inBrackets = true;
if (!inBrackets && line[i] === ",")
{
split.push(line.substring(lastSplit, i));
lastSplit = i + 1;
}
}
split.push(line.substring(lastSplit));
return split;
}
var line = "{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com";
var emails = splitLine(line);
var finalList = [];
for (var i = 0; i < emails.length; i++)
{
finalList = finalList.concat(getEmails(emails[i]));
}
console.log(finalList);
// Outputs: [ "a.b@uni.somewhere", "c.d@uni.somewhere", "e.f@uni.somewhere", "x.y@edu.com", "z.k@edu.com" ]
If you want to try and implement the full lexer / tokenizer solution, you can look at the simple / dumb lexer I built as a starting point. The general idea is that you have a state machine (in my case I only had two states: inBrackets
and !inBrackets
) and you read one letter at a time but interpret it differently based on your current state.