I'm trying to join to datasets where the join fields are 1. a numeric column that i want to allow a difference threshold of 0.05, and 2. exact character matches of two other fields. See below for a simplified example of the two datasets and the desired output:
# df1
site genus distance
HA Melaleuca 0.1
HA Melaleuca 0.1
HA Eucalyptus 0.3
HA Melaleuca 0.6
HA Eucalyptus 1.3
HA Eucalyptus 1.55
HA Eucalyptus 1.55
HA Melaleuca 1.75
HA Melaleuca 1.8
HA Melaleuca 1.9
#df2
site genus distance 1998 2008
HA Eucalyptus 0.1 na 4
HA Melaleuca 0.1 4 4
HA Eucalyptus 0.3 4 d
HA Melaleuca 0.65 4 3
HA Melaleuca 1.8 na 4
HA Eucalyptus 1.6 5 4
HA Eucalyptus 2.1 2 3
HA Melaleuca 2.5 4 4
HA Melaleuca 2.6 4 3
HA Eucalyptus 2.7 2 n/a
#desired join output
site genus distance 1998 2008
HA Melaleuca 0.1 na 4
HA Melaleuca 0.1 4 4
HA Eucalyptus 0.3 4 d
HA Melaleuca 0.6 4 3
HA Eucalyptus 1.3 na na
HA Eucalyptus 1.55 5 4
HA Eucalyptus 1.55 5 4
HA Melaleuca 1.75 na na
HA Melaleuca 1.8 na na
HA Melaleuca 1.9 na na
HA Eucalyptus 2.1 2 3
HA Melaleuca 2.5 4 4
HA Melaleuca 2.6 4 3
HA Eucalyptus 2.7 2 n/a
The function difference_full_join()
[fuzzyjoin package] allows to specify a distance threshold (in the above case, it matches any with a "distance" value within 0.05), but I can't within this use a exact character match to make sure that the site
and genus
columns are the same. That second part is easy enough to do in the fuzzy_full_join()
function by specifying ==
as the function to match on, but I don't know how I can write a function for the match_fun = argument so that it performs the same thing. all examples of this function I can find online use simpler terms like <=
.