I have datasets with roughly 100s of thousands of rows (or roughly 100s of thousands of RDF triples.) From a tabular standpoint, each row represents a person's participation in some process. The data are noisy, and what appear to be separate individuals may in reality be the same person. I need to authoritatively assign new identifiers to each unique person that is modeled in the data, according to rules, but I don't even know if there is a name for this practice.
I am familiar with may kinds of clustering methods, but this seems different to me. I have no idea of the true number unique individuals, and I don't want to find individuals with a minimal distance between them. I want to find individuals that satisfy some rules provided by my collaborators.
For example, if I have these data:
+-------------+-----+------------+--------+
| Transaction | ID | DOB | Gender |
+-------------+-----+------------+--------+
| 1 | 111 | 5/5/1969 | M |
| 2 | 112 | 6/6/1966 | F |
| 3 | 113 | 7/7/1970 | F |
| 4 | 113 | 9/9/1970 | F |
| 5 | 114 | 2/3/2000 | M |
| 6 | 114 | 2/4/2000 | F |
| 7 | 115 | 9/10/2001 | M |
| 8 | 115 | 11/11/2001 | F |
+-------------+-----+------------+--------+
And these exhaustive rules
- people that have the same identifier and the same gender are the same person
- people who have the same identifier and have birth dates within one day of another are the same person
Then the solution would be
+-------------+-----+------------+--------+-----------------+-----------------------------------------------------+
| Transaction | ID | DOB | Gender | UniqueIdByRules | Notes |
+-------------+-----+------------+--------+-----------------+-----------------------------------------------------+
| 1 | 111 | 5/5/1969 | M | A | |
| 2 | 112 | 6/6/1966 | F | B | |
| 3 | 113 | 7/7/1970 | F | C | |
| 4 | 113 | 9/9/1970 | F | C | IDs identical, genders identical |
| 5 | 114 | 2/3/2000 | M | D | |
| 6 | 114 | 2/4/2000 | F | D | IDs identical, birthdates within one day of another |
| 7 | 115 | 9/10/2001 | M | E | |
| 8 | 115 | 11/11/2001 | F | F | |
+-------------+-----+------------+--------+-----------------+-----------------------------------------------------+
My "best language" is R, but my project's core language is Scala. So I'm especially interested in solutions that could be reasonably implemented in R, Scala, or Java. The original data come as tables but are transformed to RDF triples fairly early in my process, so maybe SWRL is relevant? One of my collaborators has casually suggested PyCLIPS for this kind of problem, so maybe Jess or Drools are relevant?
- What is my problem/task called?
- Are there existing solutions for this, other than exhaustive pairwise comparison?
- Am I going to have problems with transitivity because I have two (or more) rules, and one of them doesn't require identity?