Approximate de-duplication

Question

Suppose I have a dataset like this:

that I need to examine for possible duplicates. Here, the 2nd and 3rd rows are suspected duplicates. I'm aware of string distance methods as well as approximate matches for numeric variables. But have the two approaches been combined? Ultimately, I'm looking for an approach that I can implement in R.

Are the second and third row potential duplicates because their timestamps are in close proximity, their strings are similar, or both? — Tim Biegeleisen, Jul 15 '19 at 13:48
Both. Actually, in the example, I also made `freq` an exact match. — Thomas Speidel, Jul 15 '19 at 13:54

score 1 · Answer 1 · answered Jul 15 '19 at 14:00

1

I don't think there is a straightforward approach to this problem. You could treat each column separatly: datetime as timestamp proximity, string as string proximity (Levenshtein distance) and freq as numeric distance. You can then individually rank each row for each column in increasing fashion. Row numbers that rank high in all three of the metrics (least differences) are the better candidates to be duplicates. You can then choose the threshold for which you consider a dulicated case.

answered Jul 15 '19 at 14:00

boski

2,437
1
14
30

Thanks. This was going to be my default approach. I was hoping something similar to the package `fuzzyjoin` existed allowing for a mixture of string distances and numeric distances depending on the variable. – Thomas Speidel Jul 15 '19 at 14:12
When you say timestamp proximity, do you have something specific in mind? I was planning to treat it `as.numeric(datetime)` – Thomas Speidel Jul 15 '19 at 14:12
1

You can use `POSIXlt` notation and use the `difftime` function like `difftime(time1,time2, units="sec")` and it will return time difference in seconds. **edit** I see `as.numeric()` turns time into seconds, thus returning the same result. Still, with `POSIXit` you can select output as minutes, hours ... – boski Jul 15 '19 at 14:16

Approximate de-duplication

1 Answers1