1

I have two large datasets with the only shared feature being a numerical timestamp. I'd like to merge the data frames by this timestamp, but the frequency of data collection doesn't match exactly so I need to allow it to merge with the nearest possible match.

As a simplified example, here's a small data set with a value column, some event, and an ID:

a<-c("150", "164", "175", "183", "195", "200", "205","213")
b<-c("start1","end1","start2", "end2", "start1", "end1", "start2", "end2")
c<-c("A","A","A", "A", "B", "B", "B", "B")

(data<-data.table(value = a, event = b, ID = c))

And I'd like to be able to merge this "data" with this numerical series ("times") by the value column:

(times<-data.frame(value = c(seq(from = 150, to = 213, by = 3))))

So that they merge by the nearest approximate match in the value column to produce this final data frame:

agoal<-c(seq(from = 150, to = 213, by = 3))
bgoal<-c("start1","","","","","end1","", "",
     "start2", "", "", "end2", "", "", "",
     "start1", "", "end1", "start2", "", "", "end2")
cgoal<-c("A","","","","","A","", "",
         "A", "", "", "A", "", "", "",
         "B", "", "B", "B", "", "", "B")

(goal<-data.frame(value = agoal, event = bgoal, ID = cgoal))

Is there a way to do this, especially for a very large dataset (so it doesn't crash R)?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
celow
  • 75
  • 1
  • 6
  • 1
    Take a look at the “fuzzyjoin” package. – Dave2e Jul 04 '21 at 15:55
  • Thanks! This seemed to work (at least for the sample data): end<-fuzzyjoin::difference_left_join(times, data, by = "value", max_dist = 1, distance_col= "distance") – celow Jul 04 '21 at 16:18
  • 1
    `data.table` provides a roll join method. One clue https://stackoverflow.com/questions/35046161/how-to-do-a-data-table-rolling-join e.g. `data[times,roll = "nearest"]` ( you need to `setkey` on `value` firstly) – Peace Wang Jul 04 '21 at 16:24

2 Answers2

2

data.table provides a roll join solution.

library(data.table)
setkey(data,value)
setkey(times,value)
data[times,roll = "nearest"]
#    value  event ID
# 1:   150 start1  A
# 2:   153 start1  A
# 3:   156 start1  A
# 4:   159   end1  A
# 5:   162   end1  A
# 6:   165   end1  A
# 7:   168   end1  A
# 8:   171 start2  A
# 9:   174 start2  A
#10:   177 start2  A
#11:   180   end2  A
#12:   183   end2  A
#13:   186   end2  A
#14:   189   end2  A
#15:   192 start1  B
#16:   195 start1  B
#17:   198   end1  B
#18:   201   end1  B
#19:   204 start2  B
#20:   207 start2  B
#21:   210   end2  B
#22:   213   end2  B

data:

a<-c("150", "164", "175", "183", "195", "200", "205","213")
b<-c("start1","end1","start2", "end2", "start1", "end1", "start2", "end2")
c<-c("A","A","A", "A", "B", "B", "B", "B")

data<-data.table(value = as.numeric(a), event = b, ID = c)

times<-data.table(value = c(seq(from = 150, to = 213, by = 3)))
Peace Wang
  • 2,399
  • 1
  • 8
  • 15
1

To join by nearest matches without filling in the gaps with approximate matches, fuzzyjoin worked well!

(end<-fuzzyjoin::difference_left_join(times, data, by = "value", max_dist = 1, distance_col= "distance"))
celow
  • 75
  • 1
  • 6