Example Set up
I am linking a dataset to find duplicate entries within it. I do not know the number of times a duplicate may appear within my dataset.
Following my blocking, I end up with the following dataset:
[This is an example dataset, not my real data]
1st Dataset: Region AB_1, df1
FName_1 SName_1 Area_1 Age_1
1a Ben Nevis AB 30
2a Ben Neviss AB 30
3a Andy Red AB 35
4a Andy Redd AB 35
2nd Dataset: Region AB_2, df2
FName_2 SName_2 Area_2 Age_2
1b Ben Nevis AB 30
2b Ben Neviss AB 30
3b Andy Red AB 35
4b Andy Redd AB 35
So, I'm comparing the records within the same dataset to each other.
I compare the above datasets together using an EM algorithm based on the Fellegi Sunter algorithm, with agreement variables "forename" "surname" and "age".
I create my comparison space by comparing every single record in Dataset 1 with every single record in Dataset 2, i.e. 4 * 4 = 16 possible record pairs.
e.g.
Record 1 vs Record 2
1a 1b
1a 2b
1a 3b
1a 4b
2a 1b
2a 2b
2a 3b
2a 4b
3a 1b
3a 2b
3a 3b
3a 4b
4a 1b
4a 2b
4a 3b
4a 4b
The issue
However, this means that the same records compared to themselves are passing into my EM algorithm:
e.g.
1a 1b
2a 2b
3a 3b
4a 4b
These are not required, they are just a remnant of forming the comparison space.
As the EM algorith is a "learning algorithm", which optimises agreement and disagreement variable weights based on its input information, I am essentially providing it more training information.
This is reflected in the results;
If I remove these records before I run my algorithm I get 3001 records above a score of 0.9 (using my real dataset).
However If I removed these records after I run my algorithm, I get only 2486 records above a score of 0.9 (using my real dataset).
I.e. it is more selective if I include these duplicate examples.
Ultimately:
It doesn't make sense to me to include them in the EM, but I'm concerned removing will lower the accuracy of my algorithm.
Should I remove these known duplicates before I run the EM?