1

I'm looking for a way to search through a database and find close similarities between email addresses. The only solution I can thing of is O(N^2), and involves a nested loop. Basically grab an email address, and then check it against the rest of the addresses, over and over. This will be extremely consuming as I'm dealing with 100,000 email addresses in a database. If it makes a difference, this will be implemented as a background job for a Ruby on Rails app.

Is there any way to do this?

I'm really only looking for basic similarities. An example would be

docjohnson@gmail.com
docjohnson1@gmail.com
docjohnson333@gmail.com
docjohnson@hotmail.com

I would want those all marked similar to each other.

Thanks for the help!

EDIT: I'm using a Mongo database connected to ROR via Mongoid, if that changes the game at all.

Kenny Bania
  • 637
  • 6
  • 18
  • I'd suggest adding some information about the type of database you are using and tag your question appropriately. It might open it up to additional experts in case this can be handled by a query. – Marc Baumbach Jan 16 '13 at 21:04
  • Have you tried something along the lines of a [fulltext search or looked into a Lucene index](http://stackoverflow.com/questions/47656/how-do-i-do-full-text-searching-in-ruby-on-rails)? "Basic similarities" is vague, but that may be what you want and a fulltext or Lucene search may be your best bet. – Marc Baumbach Jan 16 '13 at 21:08
  • I just added database info to my question, thanks for the suggestion! I haven't tried anything like that. I'm a bit confused how this would be significantly faster, as if I'm understanding it right I would still need nested loops to check everything. – Kenny Bania Jan 16 '13 at 21:19
  • 1
    do you have an idea of exactly what "similar" means? probably the first is to come up with a similarity metric before you can work on the algorithm. – thang Jan 16 '13 at 21:23
  • The full text or Lucene indices would allow you to perform searches and get a certain "relevance" score for each result. You could set a threshold for what is considered "similar." This may be overkill, but those searches will typically be faster and you wouldn't need the O(N^2) loops anymore. – Marc Baumbach Jan 16 '13 at 21:24
  • thang - A good point. While I don't have an exact definition of "similar" for this case I'm really looking for email address that are obvious modifications of each other such as the above examples. Simple character substitutions and added numerals. – Kenny Bania Jan 16 '13 at 21:33
  • Marc - I see now. That could definitely work, I'll look into it more. – Kenny Bania Jan 16 '13 at 21:34

2 Answers2

1

Compute a "signature" for each email address; for instance, a signature might be the first five characters of the username part of the address. Sort all email addresses to bring together those with identical signatures; if your signature algorithm does a good job, each set of signatures should refer to the same person. You'll have to tune the signature algorithm based on your data and your definition of similarity.

user448810
  • 17,381
  • 4
  • 34
  • 59
  • This is definitely an idea to consider, though I feel tuning the signature algorithm could be difficult. Better than nested loops though – Kenny Bania Jan 16 '13 at 21:25
  • The idea is to pick a simple signature algorithm, run it, see what you get, modify it based on experience, repeat over several iterations, then quit as soon as you're "close enough;" you're trying for close, not perfect. You can do a lot with your signature function; for instance, you might change all instances of gmail.com, yahoo.com or hotmail.com to gmailyahoohotmail, then add the first eight characters of the username, giving a signature of gmailyahoohotmaildocjohns for all of the examples you gave. – user448810 Jan 16 '13 at 21:29
  • I'm beginning to like this idea more and more. It's simple to implement and I think would be accurate enough. – Kenny Bania Jan 16 '13 at 21:37
  • The ingredients of a good signature algorithm are strong knowledge of the data, imagination in building the details of the algorithm, and repeated measurement of results of many similar algorithms. For instance, do you get better clusters with seven, eight, or nine characters of the username? The success of your task will be based on your definition of "close enough." My suggestion is keep it simple, declare success sooner rather than later, and add complexity only slowly over time as experience indicates it is required. – user448810 Jan 16 '13 at 21:53
1

I suggest that you start with "canonicalizing" the e-mails:

  1. strip trailing digits from the username part, e.g., john123 -> john.

  2. maybe drop some punctuation from the username, e.g., john.smith -> johnsmith.

  3. drop the some hosts from the domain part, e.g., mail.foo.com -> foo.com; but not math.mit.edu -> mit.edu.

after you do 1 & 2, you should collect the original emails into a hash table mapping the canonical usernames to the original ones, so that when you are done, you only need to iterate over the canonical usernames.

sds
  • 58,617
  • 29
  • 161
  • 278