Questions tagged [fuzzyjoin]

An R package for joining tables together on inexact matching.

Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance, regular expression, or custom matching functions. Uses similar syntax as dplyr's joins.

161 questions
4
votes
1 answer

Fuzzy Logic Join using two columns

I'm using the r package fuzzyjoin to join two data sets. Currently I am joining on one column, and would like to join on two. first dataset has the name of a location and a column called config second dataset has the name of a location and two…
steppermotor
  • 701
  • 6
  • 22
3
votes
2 answers

Joining two datasets by (non-uniform) names

I need to join two datasets and the only identifier in both are the company names. For example: db1 <- tibble( Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'), …
msn
  • 113
  • 3
3
votes
1 answer

Fuzzy matching two data frames

I want to merge two data frames df1 and df2. df1<-tibble(x=c("FIDELITY FREEDOM 2015 FUND", "VANGUARD WELLESLEY INCOME FUND"),y=c(1,2)) df2<-tibble(x=c("FIDELITY ABERDEEN STREET TRUST: FIDELITY FREEDOM 2015 FUND", "VANGUARD/WELLESLEY INCOME FUND,…
Jane
  • 91
  • 4
3
votes
1 answer

How to join location data (lat,lon)

I have to dataset, one with some location (lat,lon), that's test, and one with the lat/lon information of all zip codes in NYC, that's test2. test <- structure(list(trip_count = 1:10, dropoff_longitude = c(-73.959862, …
Maximilian
  • 89
  • 1
  • 7
3
votes
1 answer

Join two large datasets in R using both exact and fuzzy matching

I'm trying to inner join two datasets: df1 of 50,000 obs looks something like this: Name | Line.1 | Line.2 | Town | County | Postcode …
3
votes
1 answer

fuzzy LEFT join with R

library(tidyverse) library(fuzzyjoin) df1 <- tibble(col1 = c("apple", "banana", "carrot"), col2 = as.numeric(0:2), col3 = as.numeric(0:2)) #> # A tibble: 3 x 3 #> col1 col2 col3 #> #> 1 apple …
Display name
  • 4,153
  • 5
  • 27
  • 75
3
votes
1 answer

Limiting the range of a merge with roll = "nearest"

I have two databases that I want to merge. From this link: Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table. I know that I can merge these data.tables, when there is no direct match, with the nearest year available as follows.: …
Tom
  • 2,173
  • 1
  • 17
  • 44
3
votes
1 answer

R: Regex_Join/Fuzzy_Join - Join Inexact Strings in Different Word Orders

df1 df2 df3 library(dplyr) library(fuzzyjoin) df1 <- tibble(a =c("Apple Pear Orange", "Sock Shoe Hat", "Cat Mouse Dog")) df2 <- tibble(b =c("Kiwi Lemon Apple", "Shirt Sock Glove", "Mouse Dog"), c = c("Fruit", "Clothes",…
rsylatian
  • 429
  • 2
  • 14
3
votes
1 answer

stringdist_join results in NAs

i am experimenting with the stringdist package in order to make fuzzy joins and i run into a problem which i do not understand and fail to find an answer for. I want to join these 2 data tables with the "dl" method and it produces a NA, which i…
Dome
  • 60
  • 6
3
votes
0 answers

How to apply multiple fuzzy joins to the same data frame

I have the following issue related with matching different data frames. In the first place, I have the next…
lolo
  • 646
  • 2
  • 7
  • 19
3
votes
1 answer

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

I was answering these two questions and got an adequate solution, but I had trouble passing arguments using fuzzy_join into the match_fun that I extracted from fuzzyjoin::stringdist_join. In this case, I'm using a mix of multiple match_fun's,…
Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
2
votes
1 answer

Why is fuzzyjoin slower than data.table in R

When I want to join two data frames based on two intervals, I prefer to use the fuzzyjoin package because it is easy to read in my opinion. But when I need to work with large datasets, the fuzzyjoin package is a no-go because it is very slow. The…
Quinten
  • 35,235
  • 5
  • 20
  • 53
2
votes
2 answers

Inexact joining data based on greater equal condition

I have some values in df: # A tibble: 7 × 1 var1 1 0 2 10 3 20 4 210 5 230 6 266 7 267 that I would like to compare to a second dataframe called value_lookup # A tibble: 4 × 2 var1 value 1 0 0 2…
Julian
  • 6,586
  • 2
  • 9
  • 33
2
votes
1 answer

Join with closest value between two values in R

I was working in the following problem. I've got monthly data from a survey, let's call it df: df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5)) ID reported_value anchor_month 1 1200 …
Juan C
  • 5,846
  • 2
  • 17
  • 51
2
votes
1 answer

dplyr::full_join two data frames with part-match in the "by" argument in R

I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree. mydata <- tibble(Item = c("ab_kssv", "ab_kd",…
lilla
  • 151
  • 3
  • 12
1
2
3
10 11