I was tasked to write a system that determines if a provided email address is in a list. Checking if a string is in a list is usually an easy task, but email addresses are complicated. For example, if I send an email to personname@gmail.com
and person.name@gmail.com
, both emails will reach the same account. From what I understand, there are several other ways a user can have two different email address strings that will end up reaching the came account (replace the period with a underscore, add a + character after the username, vary letter case, etc).
Users of this system have an incentive to provide multiple email addresses that fool the list check yet lead to the same account (personname@gmail.com
and person.name@gmail.com
). I want to find some way to determine if two email addresses will both lead to the same email provider account (preferably in Python, though I can port any solution).
My first solution was to try enumerating the aforementioned tricks and reversing them to get email addresses to some common form. For example, remove all underscores and dots, remove everything between the first + and the @ sign, and convert the email to all lowercase. The problem is, I'm not 100% sure that is an exhaustive list of all possible tricks, nor do I know if those tricks work for all providers. Is there a library or common method of performing such a check that is more robust than this method? Am I stuck with having to perform these limited checks and then eat the cost of smarter users managing to successfully deceive my system?