-1

I have datasets with roughly 100s of thousands of rows (or roughly 100s of thousands of RDF triples.) From a tabular standpoint, each row represents a person's participation in some process. The data are noisy, and what appear to be separate individuals may in reality be the same person. I need to authoritatively assign new identifiers to each unique person that is modeled in the data, according to rules, but I don't even know if there is a name for this practice.

I am familiar with may kinds of clustering methods, but this seems different to me. I have no idea of the true number unique individuals, and I don't want to find individuals with a minimal distance between them. I want to find individuals that satisfy some rules provided by my collaborators.

For example, if I have these data:

+-------------+-----+------------+--------+
| Transaction | ID  |    DOB     | Gender |
+-------------+-----+------------+--------+
|           1 | 111 | 5/5/1969   | M      |
|           2 | 112 | 6/6/1966   | F      |
|           3 | 113 | 7/7/1970   | F      |
|           4 | 113 | 9/9/1970   | F      |
|           5 | 114 | 2/3/2000   | M      |
|           6 | 114 | 2/4/2000   | F      |
|           7 | 115 | 9/10/2001  | M      |
|           8 | 115 | 11/11/2001 | F      |
+-------------+-----+------------+--------+

And these exhaustive rules

  • people that have the same identifier and the same gender are the same person
  • people who have the same identifier and have birth dates within one day of another are the same person

Then the solution would be

+-------------+-----+------------+--------+-----------------+-----------------------------------------------------+
| Transaction | ID  |    DOB     | Gender | UniqueIdByRules |                        Notes                        |
+-------------+-----+------------+--------+-----------------+-----------------------------------------------------+
|           1 | 111 | 5/5/1969   | M      | A               |                                                     |
|           2 | 112 | 6/6/1966   | F      | B               |                                                     |
|           3 | 113 | 7/7/1970   | F      | C               |                                                     |
|           4 | 113 | 9/9/1970   | F      | C               | IDs identical, genders identical                    |
|           5 | 114 | 2/3/2000   | M      | D               |                                                     |
|           6 | 114 | 2/4/2000   | F      | D               | IDs identical, birthdates within one day of another |
|           7 | 115 | 9/10/2001  | M      | E               |                                                     |
|           8 | 115 | 11/11/2001 | F      | F               |                                                     |
+-------------+-----+------------+--------+-----------------+-----------------------------------------------------+

My "best language" is R, but my project's core language is Scala. So I'm especially interested in solutions that could be reasonably implemented in R, Scala, or Java. The original data come as tables but are transformed to RDF triples fairly early in my process, so maybe SWRL is relevant? One of my collaborators has casually suggested PyCLIPS for this kind of problem, so maybe Jess or Drools are relevant?

  • What is my problem/task called?
  • Are there existing solutions for this, other than exhaustive pairwise comparison?
  • Am I going to have problems with transitivity because I have two (or more) rules, and one of them doesn't require identity?
Mark Miller
  • 3,011
  • 1
  • 14
  • 34
  • What is your exact problem? You want the final dataframe from your initial? or you already have solution and you are just trying to figure out names for the process? – Ramesh Maharjan Jul 07 '17 at 14:54
  • thanks, @RameshMaharjan. I haven't implemented anything yet. I typed out the final and initial by hand. I can't think of any way to do the 1-day birth date comparison besides an exhaustive pairwise comparison, which doesn't seem very efficient. And the rules could change in the future, so I'm looking for something generalizable. – Mark Miller Jul 07 '17 at 15:03
  • @MarkMiller You have a minor mistake in your sample data, i.e. id `113` won't match based on the gender rule – UninformedUser Jul 07 '17 at 15:10
  • I'm open to solutions or hints in pure R (with data frames), pure Scala, SparkR (with data frames), Jess, Drools, SWRL... 5 tags isn't enough! – Mark Miller Jul 07 '17 at 15:11
  • 1
    @AKSW: thanks, I edited it – Mark Miller Jul 07 '17 at 15:12
  • 1
    @MarkMiller Isn't the general task simply called *deduplication*? And if so, I guess in RDF this is called *Link Discovery*, in your case the link would be `owl:sameAs`. I mean, in the end, you want to find entities having the same identity although some data might me conflicting. And the final step would be *data fusion* where you're doing conflict resolution etc – UninformedUser Jul 07 '17 at 17:16
  • @AKSW: *deduplication* and *link discovery* have been very useful search terms, although I don't want to use any deduplication method that removes any data. Thanks, I'll post a bare-bones implementation later. – Mark Miller Jul 10 '17 at 15:19
  • @MarkMiller I understand that you don't want to remove data but in order to do data fusion - which is basically what you want - you need to discover those entities in the dataset that represent the same entity in the real world (reps. domain of interest). There is a lot of research in that area, with very interesting approaches to solve this issue and to reduce the complexity. – UninformedUser Jul 11 '17 at 06:31
  • Still not sure if this is what you need but you could also reduce it to the problem of link discovery the goal of which is to find links between entities (with any given semantics) of two datasets `S` and `T`. In your case, you want to find links between `S` and `S` given that `S` is your dataset. But I guess you already found a sufficient solution – UninformedUser Jul 11 '17 at 06:34

1 Answers1

0

This is called a "for loop" with "if statements".

Sort the data by ID, iterate over all IDs. if there is more than one, check your conditions with if statements.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194