How to do a fuzzy join and a difference join at the same time

Question

I'm trying to join to datasets where the join fields are 1. a numeric column that i want to allow a difference threshold of 0.05, and 2. exact character matches of two other fields. See below for a simplified example of the two datasets and the desired output:

# df1
site    genus   distance        
HA  Melaleuca   0.1     
HA  Melaleuca   0.1     
HA  Eucalyptus  0.3     
HA  Melaleuca   0.6     
HA  Eucalyptus  1.3     
HA  Eucalyptus  1.55        
HA  Eucalyptus  1.55        
HA  Melaleuca   1.75        
HA  Melaleuca   1.8     
HA  Melaleuca   1.9



#df2
site    genus   distance    1998    2008
HA  Eucalyptus  0.1 na  4
HA  Melaleuca   0.1 4   4
HA  Eucalyptus  0.3 4   d
HA  Melaleuca   0.65    4   3
HA  Melaleuca   1.8 na  4
HA  Eucalyptus  1.6 5   4
HA  Eucalyptus  2.1 2   3
HA  Melaleuca   2.5 4   4
HA  Melaleuca   2.6 4   3
HA  Eucalyptus  2.7 2   n/a

#desired join output
site    genus   distance    1998    2008
HA  Melaleuca   0.1 na  4
HA  Melaleuca   0.1 4   4
HA  Eucalyptus  0.3 4   d
HA  Melaleuca   0.6 4   3
HA  Eucalyptus  1.3 na  na
HA  Eucalyptus  1.55    5   4
HA  Eucalyptus  1.55    5   4
HA  Melaleuca   1.75    na  na
HA  Melaleuca   1.8 na  na
HA  Melaleuca   1.9 na  na
HA  Eucalyptus  2.1 2   3
HA  Melaleuca   2.5 4   4
HA  Melaleuca   2.6 4   3
HA  Eucalyptus  2.7 2   n/a

The function difference_full_join() [fuzzyjoin package] allows to specify a distance threshold (in the above case, it matches any with a "distance" value within 0.05), but I can't within this use a exact character match to make sure that the site and genus columns are the same. That second part is easy enough to do in the fuzzy_full_join() function by specifying == as the function to match on, but I don't know how I can write a function for the match_fun = argument so that it performs the same thing. all examples of this function I can find online use simpler terms like <=.

score 0 · Answer 1 · answered Mar 31 '21 at 04:12

I wanted to post an answer that I found as a workaround.

Because the character matches were indentical, and not fuzzy, I added a numeric factor column to the two original datasets (so Eucalyptus = 1, Melaleuca = 2, and HA = 1 in example above. Then, I did a difference_full_join with a difference of 0.5 - because the factors were a perfect match, this meant that the difference_join was only used on the distance column.

How to do a fuzzy join and a difference join at the same time

1 Answers1