1

I have two tables that I would like to merge based on an exact match on one variable and a fuzzy match on another.

Consider the two tables below. For each id1 in dt1, I would like to find an id2 in dt2 that matches exactly on size and where the date value in dt2 is either equal to or later than the date field in dt1. If there are multiple matches, I would like to have one randomly selected.

dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
c("id1", "size", "date"))

dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
"2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
c("id2", "size", "date"))

The resulting table could look like this:

   id1 size       date  id2
1:   A    2 2013-03-27    1
2:   B    3 2014-05-08    3

or like this (depending on the random selection)

   id1 size       date  id2
1:   A    2 2013-03-27    4
2:   B    3 2014-05-08    3    
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64
user1389960
  • 433
  • 4
  • 11

2 Answers2

1

I'm not sure that this is typically what most people have in mind when they say 'fuzzy matching'--you want to combine two tables and then do something random with the result of the match, as in:

library(data.table)
library(tidyverse)

set.seed(1234)
dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
         c("id1", "size", "date"))

dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
                                                                   "2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
         c("id2", "size", "date"))

dt <- full_join(dt1, dt2, by = "size") %>% 
  filter(date.y >= date.x) %>% 
  group_by(size) %>%
  sample_n(size = 1)
edavidaja
  • 698
  • 1
  • 6
  • 15
1

To join by size and select appropriate date entries we can use a non-equi join:

> # Rename the date columns to make the join step clear:
> setnames(dt1, "date", "date1")
> setnames(dt2, "date", "date2")

> # Non equi-join will give all entries in dt2 matching on size where
> # date2 >= date1:
> dt2[dt1, on=.(size, date2 >= date1)]
   id2 size      date2 id1
1:   4    2 2013-03-27   A
2:   1    2 2013-03-27   A
3:   3    3 2014-05-08   B

I couldn't find a reliable way of doing the random selection step in conjunction with the join. As a hacky solution, we can add a new column to the table above containing the shuffled row numbers, then select the row with the largest shuffled row number per id1:

> joined <- dt2[dt1, on=.(size, date2 >= date1)]
> joined[, selection_column := sample(.I, .N)] 
> filtered <- joined[,.SD[which.max(selection_column)], by=id1]
> filtered[, selection_column := NULL]
> filtered
   id1 id2 size      date2
1:   A   1    2 2013-03-27
2:   B   3    3 2014-05-08

Alternatively, we can use dplyr to do the random selection step:

> library(dplyr)
> dt2[dt1, on=.(size, date2 >= date1)] %>% 
+   group_by(id1) %>% 
+   sample_n(1) %>% 
+   as.data.table()  
   id2 size      date2 id1
1:   4    2 2013-03-27   A
2:   3    3 2014-05-08   B
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64
  • 1
    What about `dt2[dt1, on=.(size, date >= date), .(id1, id2 = sample(id2,1)), by=.EACHI]` ? – thelatemail Nov 21 '17 at 00:50
  • That works if you don't want to keep the size and date columns, but I couldn't see a way of keeping multiple columns using that approach. – Scott Ritchie Nov 21 '17 at 00:58
  • @ScottRitchie I don't see any "missing" columns in thelatemail's answer. Also, if you want to do the random selection after, it's trivial to do - one way would be: `dt2[dt1, on=.(size, date >= date)][, .SD[sample(.N, 1)], by = id1]`. Also note that you don't need to rename columns. – eddi Nov 21 '17 at 17:21
  • @eddi looks like you're right. I'm not sure why I couldn't get those approaches to work yesterday. – Scott Ritchie Nov 22 '17 at 03:50