Best method to deduplicate people records in Rails

Question

I am writing a rails app with a Person model that looks something like this:

  create_table "people", :force => true do |t|
    t.string   "first_name"
    t.string   "last_name"
    t.string   "email"
    t.datetime "created_at", :null => false
    t.datetime "updated_at", :null => false
  end

I have a two step process as follows:

Fill out person records, with the names of people. The names of people may have unkown duplicates, due to nicknames, etc. For example, "tim smith" and "timothy smith"
Query an API to get potential email address matches for those people.

After doing that processing, I could have data like:

record 1: first_name: tim last_name: smith email: tim.smith@sampleemail.com

record 2: first_name: timothy last_name: smith email: tim.smith@sampleemail.com

What's the best way in rails to model that those are duplicates?

UPDATE: CLARIFICATION

After step 2, I know how to find out that those two records are duplicates (i.e. the same person), my question is how to represent that in the model? Should I add a "duplicate_of_person_id" type field and put the id of the first record in that field in the second record? Is there a better way?

score 1 · Answer 1 · answered Dec 12 '13 at 04:43

You could link all the records together. The first scheme that comes to mind is to keep the record with the lowest id as the winner and make all the dupes point to it. You could also do a has_and_belongs_to_many, which would involve a separate table where each record says that these two people are the same. The latter grows quadratically with the number of people, though.

Or, just copy all the information from the second into the first and delete the second.

score 0 · Answer 2 · answered Mar 26 '13 at 22:34

0

Not 100% sure what you're asking for. If you just want to find duplicates, and, say, list them in an array, you could create a method like this:

# This isn't particularly efficient, but it should return an array in which
# each element is a list of duplicated people (assuming we define duplicates
# by doubled email addresses). 
def self.find_duplicates
  array = []
  self.each do |person|
   similar = self.find_by_email
   if similar.count > 1
    array << similar
   end
  end
  return array
end

If you don't want to allow for duplicates, just create a validation in your model:

validates :email, :uniqueness => true

be sure, before that, though, to make sure that emails are all in the same case. You could do something like this, again in the model:

before_validation :format_emails

def format_emails
  self.email = self.email.downcase
end

answered Mar 26 '13 at 22:34

Sasha

6,224
10
55
102

uniqueness checks should be done on the database to avoid threading and drifting issues, remove complicated code to handle those issues and have more efficiency. – scones Mar 26 '13 at 22:45
1

**Threading**: Two instances of the application sending the same data at the same time. Application checks won't be able to catch that (Can also be called a `race condition`, i guess). **Drifting**: Master slave database server setup, where the slave has not received the new data set, get's questioned wether the data exists and answers truthfully no. the data set will be duplicated then. – scones Mar 26 '13 at 22:49
Sorry, should have been more clear. After step 2, I know how to find out that those two records are duplicates (i.e. the same person), my question is how to represent that in the model? Should I add a "duplicate_of_person_id" type field and put the id of the first record in that field in the second record? – DougB Mar 27 '13 at 02:47

Best method to deduplicate people records in Rails

2 Answers2