3

Example Set up

I am linking a dataset to find duplicate entries within it. I do not know the number of times a duplicate may appear within my dataset.

Following my blocking, I end up with the following dataset:

[This is an example dataset, not my real data]

1st Dataset: Region AB_1, df1

  FName_1 SName_1 Area_1 Age_1
 1a Ben     Nevis   AB     30
 2a Ben     Neviss  AB     30
 3a Andy    Red     AB     35
 4a Andy    Redd    AB     35

2nd Dataset: Region AB_2, df2

  FName_2 SName_2 Area_2 Age_2
1b Ben     Nevis   AB     30
2b Ben     Neviss  AB     30
3b Andy    Red     AB     35
4b Andy    Redd    AB     35

So, I'm comparing the records within the same dataset to each other.

I compare the above datasets together using an EM algorithm based on the Fellegi Sunter algorithm, with agreement variables "forename" "surname" and "age".

I create my comparison space by comparing every single record in Dataset 1 with every single record in Dataset 2, i.e. 4 * 4 = 16 possible record pairs.

e.g.

Record 1 vs Record 2
1a          1b
1a          2b
1a          3b
1a          4b
2a          1b
2a          2b
2a          3b
2a          4b
3a          1b
3a          2b
3a          3b
3a          4b
4a          1b
4a          2b
4a          3b
4a          4b

The issue

However, this means that the same records compared to themselves are passing into my EM algorithm:

e.g.

1a          1b
2a          2b
3a          3b
4a          4b

These are not required, they are just a remnant of forming the comparison space.

As the EM algorith is a "learning algorithm", which optimises agreement and disagreement variable weights based on its input information, I am essentially providing it more training information.

This is reflected in the results;

If I remove these records before I run my algorithm I get 3001 records above a score of 0.9 (using my real dataset).

However If I removed these records after I run my algorithm, I get only 2486 records above a score of 0.9 (using my real dataset).

I.e. it is more selective if I include these duplicate examples.

Ultimately:

It doesn't make sense to me to include them in the EM, but I'm concerned removing will lower the accuracy of my algorithm.

Should I remove these known duplicates before I run the EM?

Chuck
  • 3,664
  • 7
  • 42
  • 76

1 Answers1

2

Well, you definitely need to include some examples of matches in the training set.

Yancey mentions that in his experience, EM starts to exhibit poor convergence when the proportion of matches is less than 5%. In the same paper, he goes on to suggest artificially enriching the training set with additional pairs.

Ultimately we're trying to build a function which estimates the overall probability that two records are a match, given an agreement vector, from a finite subset (S) of all possible combinations A x B. If there are no matches, then I can give you that function immediately: p(a,b) = 0.

However, you say that you still get some scores of 0.9 even after removing some explicit duplicates. That suggests that your data set contains lots of natural matches too, i.e. records that don't have the same ID but do match on name/age/area. That's good. However, there's no reason to only train on these natural matches/duplicates. Since your record linkage algorithm will undoubtedly see many exact matches when run on real data, it should also be exposed to exact matches while training.

Finally, I will say that using the same 0.9 threshold for each may not be meaningful. These probabilities are with respect to the training set S, not the real world. And since two different training sets were used, they aren't even comparable to one another! Instead, you should construct a hold-out set of pairs with a known number of true and false matches. Then determine a threshold that corresponds to the ratio of false positives and false negatives you think is optimal. This is often done by drawing an ROC curve. Only then will you know which classifier will best generalize to real-world data.

olooney
  • 2,467
  • 1
  • 16
  • 25
  • Very much appreciate your thought out response - I understand what you are saying regarding the convergence (we have had such issues in the past). Essentially, these duplicate matches only exist as a result of a `join` operation in Python; were I to be comapring these records correctly, they wouldn't get through. However I agree completely it may be necessary to artificially include more matches in each of our datasets; would including these duplicates be a statistically correct / uniform way of doing, since for each geographic region, there would be a different set of duplicates? – Chuck Oct 04 '17 at 08:33