How can I safely determine if an email address is in a list?

Question

I was tasked to write a system that determines if a provided email address is in a list. Checking if a string is in a list is usually an easy task, but email addresses are complicated. For example, if I send an email to personname@gmail.com and person.name@gmail.com, both emails will reach the same account. From what I understand, there are several other ways a user can have two different email address strings that will end up reaching the came account (replace the period with a underscore, add a + character after the username, vary letter case, etc).

Users of this system have an incentive to provide multiple email addresses that fool the list check yet lead to the same account (personname@gmail.com and person.name@gmail.com). I want to find some way to determine if two email addresses will both lead to the same email provider account (preferably in Python, though I can port any solution).

My first solution was to try enumerating the aforementioned tricks and reversing them to get email addresses to some common form. For example, remove all underscores and dots, remove everything between the first + and the @ sign, and convert the email to all lowercase. The problem is, I'm not 100% sure that is an exhaustive list of all possible tricks, nor do I know if those tricks work for all providers. Is there a library or common method of performing such a check that is more robust than this method? Am I stuck with having to perform these limited checks and then eat the cost of smarter users managing to successfully deceive my system?

`personname` and `person.name` do not reach the same email account — Sayse, Jan 17 '19 at 22:12
There is probably a published spec for email address formats that describe what is valid as an alias and what is not. You don't have to bother to check if the email provider supports it, because you can reasonably assume it doesn't treat them as different addresses (it just might not deliver them). Once you have the spec, your idea of creating a canonical list of emails and checking against that makes sense. — minboost, Jan 17 '19 at 22:14
As far as I'm aware, the tricks you've described apply to Gmail addresses, but aren't common across email in general. Other services may have adopted them, but there are probably many that treat `person.name@example.com` and `personname@example.com` as different addresses. — Harry Cutts, Jan 17 '19 at 22:16
This is a very broad question and not a great fit for SO as is - you might want to check the docs and try to narrow it down to something specific (code level, especially) you are having trouble with. At a high level, though, the simple answer is 'no'. There is no general, reliable way you can determine if a bunch of email addresses end up in the same mailbox just by examining the strings they're made up of. Whatever you're building should try to avoid depending on such a process. — pvg, Jan 17 '19 at 22:17
What you are doing is trying to find patterns where there are none. And of cause you will always find some in the randomness. — Klaus D., Jan 17 '19 at 22:21
@Sayse It appears upon further research that what you said is true, but gmail specifically ignores them. Hmm that certainly complicates my problem... — nyx, Jan 18 '19 at 02:58

score -1 · Answer 1 · edited Oct 07 '21 at 13:31

Unfortunately the behaviors you describe are entirely up to the email provider. Gmail may ignore certain characters but other providers won't, which means your rules could generate false matches. The SMTP specification RFC 5321 2.3.11 explicitly says that you cannot make any assumptions about how email providers interpret email addresses because the treatment is entirely up to them (highlighted in bold):

An address normally consists of user and domain specifications. The
standard mailbox naming convention is defined to be "local-part@domain"; contemporary usage permits a much broader set of applications than simple "user names". Consequently, and due to a long history of problems when intermediate hosts have attempted to optimize transport by modifying them, the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address.

So there are no universal rules for email. The best you can do is to use a separate set of rules for each email provider, which could give you some success, but the solution will never be perfect.

How can I safely determine if an email address is in a list?

1 Answers1