0

I am trying to do a full join on two datasets where the merge by columns differ in their values and I am trying to merge data based on the closest lower value. Example dataset 1

enter image description here

Example dataset 2

enter image description here

What I want it to look like

enter image description here

I have tried to use fuzzy_full_join

fuzzy_full_join(gps.pts, dive.sum, by = c('num' = 'num.1'), match_fun = '>')

but get the following error

Error in which(m) : argument to 'which' is not logical
Rana
  • 25
  • 5

1 Answers1

1

We could use join_by()

dplyr >= 1.1.0

We now could do

Rolling joins

Rolling joins are a variant of inequality joins that limit the results returned from an inequality join condition. They are useful for "rolling" the closest match forward/backwards when there isn't an exact match. To construct a rolling join, wrap an inequality with closest().

closest(expr)

expr must be an inequality involving one of: >, >=, <, or <=

For example, closest(x >= y) is interpreted as: For each value in x, find the closest value in y that is less than or equal to that x value.

closest() will always use the left-hand table (x) as the primary table, and the right-hand table (y) as the one to find the closest match in, regardless of how the inequality is specified. For example, closest(y$a >= x$b) will always be interpreted as closest(x$b <= y$a).

See ?join_by()

library(dplyr) #>= 1.1.0

left_join(df1, df2, join_by(closest(num >= num)))

   num.x      Lat      Long num.y dives underwate
1      2 30.89878 -88.78650     1     5      6.77
2      6 30.89760 -88.97650     4     9      5.99
3      8 30.89786 -88.67450     7     1     12.44
4     12 40.88764 -88.76450     9    22     23.20
5     15 30.98786 -88.87645    14    16      1.50
6     18 30.87650 -88.53726    16    11      2.40
7     22 30.62534 -88.66543    20    19     22.40
8     25 30.64536 -88.43434    24     7     18.60
9     29 30.98764 -88.76285    27     9     11.20
10    32 30.87465 -88.23488    31    23    197.70

data.table

library(data.table)

# Convert your dataframes to data.tables
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

# Set keys for rolling join
setkey(dt1, num)
setkey(dt2, num)

# Perform the rolling join
result <- dt2[dt1, roll = Inf]

# combine
cbind(dt2[,1, with = FALSE], result)

data:


df1 <- structure(list(num = c(2L, 6L, 8L, 12L, 15L, 18L, 22L, 25L, 29L, 
32L), Lat = c(30.89878, 30.8976, 30.897865, 40.88764, 30.98786, 
30.8765, 30.62534, 30.64536, 30.98764, 30.87465), Long = c(-88.7865, 
-88.9765, -88.6745, -88.7645, -88.87645, -88.53726, -88.665433, 
-88.434337, -88.76285, -88.23488)), class = "data.frame", row.names = c(NA, 
-10L))
         
df2 <- structure(list(num = c(1L, 4L, 7L, 9L, 14L, 16L, 20L, 24L, 27L, 
31L), dives = c(5L, 9L, 1L, 22L, 16L, 11L, 19L, 7L, 9L, 23L), 
    underwate = c(6.77, 5.99, 12.44, 23.2, 1.5, 2.4, 22.4, 18.6, 
    11.2, 197.7)), class = "data.frame", row.names = c(NA, -10L
))
TarJae
  • 72,363
  • 6
  • 19
  • 66