0

Suppose I have a dataset like this:

example data

that I need to examine for possible duplicates. Here, the 2nd and 3rd rows are suspected duplicates. I'm aware of string distance methods as well as approximate matches for numeric variables. But have the two approaches been combined? Ultimately, I'm looking for an approach that I can implement in R.

Thomas Speidel
  • 1,369
  • 1
  • 14
  • 26

1 Answers1

1

I don't think there is a straightforward approach to this problem. You could treat each column separatly: datetime as timestamp proximity, string as string proximity (Levenshtein distance) and freq as numeric distance. You can then individually rank each row for each column in increasing fashion. Row numbers that rank high in all three of the metrics (least differences) are the better candidates to be duplicates. You can then choose the threshold for which you consider a dulicated case.

boski
  • 2,437
  • 1
  • 14
  • 30
  • Thanks. This was going to be my default approach. I was hoping something similar to the package `fuzzyjoin` existed allowing for a mixture of string distances and numeric distances depending on the variable. – Thomas Speidel Jul 15 '19 at 14:12
  • When you say timestamp proximity, do you have something specific in mind? I was planning to treat it `as.numeric(datetime)` – Thomas Speidel Jul 15 '19 at 14:12
  • 1
    You can use `POSIXlt` notation and use the `difftime` function like `difftime(time1,time2, units="sec")` and it will return time difference in seconds. **edit** I see `as.numeric()` turns time into seconds, thus returning the same result. Still, with `POSIXit` you can select output as minutes, hours ... – boski Jul 15 '19 at 14:16