0

I'm fairly new to R, and have been sifting through other questions all morning trying to figure this out, but can't find anything related enough or my knowledge of R is not good enough to understand some of the suggested solutions to my problem.

I have two data frames, table A with a list of non-unique identifiers and a date, and table B with the same identifier field, and a start and end date outlining a 3 month date range. In my real data, I have 1.7m records in table A, and 1.6k records in table A (as well many other fields that i'll use for the final analysis). I am expecting the vast majority of records in table A to be unnecessary.

What I want to achieve is to join the two tables together, joining on the identifier, and then only joining if the date in table A falls inside the date range in table B. I want this as an inner join so I lose the unnecessary data.

Example tables:

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

Expected result (we lose person 3 because the date falls outside the range, person 2 has two records because they had two entries in table b with corresponding dates):

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

I've been basing my solution on this web page https://exploratory.io/note/exploratory/How-to-join-two-data-frames-with-date-ranges-moq8hEQ6, however my example adds in the requirement to also join the identifier.

My solution that logically makes sense to me based on my (limited) ability in R:

Joined <- fuzzy_inner_join(b, a, by = c("numberb"="numbera", "datex"="date1", "datex"="date2"),
                           match_fun = list("=", ">=", "<="))

However I get this error message:

Error in which(m) : argument to 'which' is not logical

Thank you in advance for any help here :)

P Meddyyy
  • 15
  • 4
  • Thank you for the responses! I now have an issue with the data set being so large my PC doesn't have enough memory to process the join, so I need to try either splitting the data into smaller chunks and unioning after, or I'll also try the data table suggestions if they are more efficient. – P Meddyyy Jul 09 '22 at 14:27

3 Answers3

2

With backticks for the match_fun and == for the number:

library(tidyverse)
library(fuzzyjoin)

a <- data.frame(
  numbera = c("1", "2", "3", "1"),
  date1 = as.Date(c("10/04/2021", "21/06/2021", "02/10/2021", "17/02/2022"),
    format = "%d/%m/%Y"
  ),
  date2 = as.Date(c("10/07/2021", "21/09/2021", "02/01/2022", "17/05/2022"),
    format = "%d/%m/%Y"
  )
)

b <- data.frame(
  numberb = c("1", "2", "2", "3", "1"),
  datex = as.Date(c("16/05/2021", "01/08/2021", "03/08/2021", "02/09/2021", "17/03/2022"),
    format = "%d/%m/%Y"
  )
)

fuzzy_inner_join(b, a,
  by = c("numberb" = "numbera", "datex" = "date1", "datex" = "date2"),
  match_fun = list(`==`, `>=`, `<=`)
)
#>   numberb      datex numbera      date1      date2
#> 1       1 2021-05-16       1 2021-04-10 2021-07-10
#> 2       2 2021-08-01       2 2021-06-21 2021-09-21
#> 3       2 2021-08-03       2 2021-06-21 2021-09-21
#> 4       1 2022-03-17       1 2022-02-17 2022-05-17

Created on 2022-07-08 by the reprex package (v2.0.1)

Carl
  • 4,232
  • 2
  • 12
  • 24
  • Thank you, marked this as the answer as it directly resolves the question. Good to know it was just syntax rather than anything more complicated!! – P Meddyyy Jul 09 '22 at 14:28
2

data.table supports non-equi joins that are the fastest I've found. The syntax is a bit different from the tidyverse but I think it's worth it for these more complicated joins.

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

library(data.table)

dt_a <- setDT(a) # change the data frame to a data table

dt_b <- setDT(b) # change the data frame to a data table

Joined <- dt_b[ # take table b
  dt_a, # join table a
  .(numberb, datex = x.datex, numbera, date1, date2), # selecting these columns
  on = .(numberb == numbera, datex>=date1, datex<=date2), # joining on these columns
  nomatch = NULL # remove non-matches for an inner join
] |> 
  setDF() # change it back to a data frame for comparison

identical(c, Joined) # TRUE
  • Thank you for the very helpful comments on each part. I'll save this for my future reference. – P Meddyyy Jul 09 '22 at 14:31
  • I am actually astounded at the efficiency of this! I was having difficulty with memory issues with the fuzzy join, as my original data set was 1.7m rows and my work laptop is not the best. I split it into 8 chunks, and even that was too intensive. I have now had a chance to try this approach, and it completed the join in less than a second. My jaw literally hit the floor! Thank you for this, i've now marked it as the answer as ultimately it solved my problem entirely. Don't worry if this too much, is there an easy way to add on all rows to the select part, without adding them manually? Thanks! – P Meddyyy Jul 15 '22 at 15:16
1

Alternate solution using data.table:

library(data.table)
dt <- merge(a,b, by.x = "numbera", by.y = "numberb")
setDT(dt)
dt[date1 <= datex & date2 >= datex]

   numbera      date1      date2      datex
1:       1 2021-04-10 2021-07-10 2021-05-16
2:       1 2022-02-17 2022-05-17 2022-03-17
3:       2 2021-06-21 2021-09-21 2021-08-01
4:       2 2021-06-21 2021-09-21 2021-08-03

akshaymoorthy
  • 326
  • 1
  • 4
  • 1
    Wil give this a go to see if it helps with the memory issue but I suspect the data is just too big. Thanks – P Meddyyy Jul 09 '22 at 14:30