Questions tagged [fuzzyjoin]

An R package for joining tables together on inexact matching.

Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance, regular expression, or custom matching functions. Uses similar syntax as dplyr's joins.

161 questions
0
votes
1 answer

R FuzzyJoin by clause with variables

I'm trying to adapt the inner join feature of the fuzzyjoin library. The code: JoinedRecs <- DataToUse1 %>% stringdist_inner_join(DataToUse2, by = c(Full.Name1 = "Full.Name2"), max_dist = 2) seems to work when I hard-code the variables in the…
0
votes
2 answers

R fuzzy join with big dataframes

I would like to do a left_join(df1, df2) based on fuzzy matches. My df1 is 100k rows big and my df2 is 25k rows big. Basically I would like to calculate the string similarity with jaro winkler method between the join_colum of the two data frames. So…
0
votes
1 answer

R fuzzyjoin loop through a list

I'm new to R and I've been trying to fuzzymatch two large datasets without crashing my computer. At first it took so long so I split the data frame into a list and used purrr:map but it's still taking a long time and not working. So now I'm taking…
Sun
  • 157
  • 11
0
votes
2 answers

I want to join two data frames one with a year range and the other with a year

I have two data frames that look like this data1<-data.frame(name=c("Jessica_Smith","John_Smith","John_Smith") , max_year=c(2000,1989,2005), min_year=c(1990,1989,2001)) data2<-data.frame(name=c("Jessica_Smith","John_Smith","John_Smith")…
sd3184
  • 69
  • 4
0
votes
1 answer

R: How to stop fuzzyjoin::interval_join from producing duplicates on the edges?

Recently I had to join two dataframes based on their timestamps. The left data contains a fixed timestamp and the right a range. I got it mostly working as you can see in my MWE, but the system tends to produce duplicate results at the crossing…
Someone2
  • 421
  • 2
  • 15
0
votes
1 answer

How do I match across large data sets using names, gender, race, and a rough estimate of age in R

I have 2 data sets each around 20k rows. df 1 contains the following information first name | last name | race | sex | year of birth | unique ID df2 contains the following first name | last name | race | sex | age I would like to join the data sets…
sd3184
  • 69
  • 4
0
votes
2 answers

Selective left join in r

I want to selectively left join two dataframes based on a joint column and the condition of rows. I saw some similar posts using fuzzyjoin and sqldf, but the previous examples I found are not exactly like mine. Example dfs: df1 <- data.frame(id =…
zeng156
  • 15
  • 4
0
votes
1 answer

How to do a fuzzy join and a difference join at the same time

I'm trying to join to datasets where the join fields are 1. a numeric column that i want to allow a difference threshold of 0.05, and 2. exact character matches of two other fields. See below for a simplified example of the two datasets and the…
sleepy
  • 93
  • 9
0
votes
1 answer

grouping two data frames and looping over with stringdist_join

I want to do a fuzzy match of American counties by decade using stringdist_join. Since county names change over time I want to match to the correct county name in each decade. If I…
John Clegg
  • 99
  • 8
0
votes
1 answer

Fuzzyjoin / stringdist_join weight for capitalisatoin (case) mismatch (stringdist)

Working with R, I'm looking for ways to weight case (i.e., upper vs lower case) in a string_dist_left_join() Here's a reproducible example: library(tidyverse) library(fuzzyjoin) tibble1 <- tibble(words = c("Bedford", "Maidenhead", "New Forest",…
gladys_c_hugh
  • 158
  • 1
  • 9
0
votes
1 answer

Fuzzy Join with POSIXct and POSIXt

test1 <- structure(list(trip_count = 1:10, pickup_datetime = structure(c(1357019059, 1357019939, 1357022493, 1357023065, 1357024439, 1357025235, 1357026348, 1357026924, 1357027562, 1357028863), tzone = "UTC", class = c("POSIXct", "POSIXt")),…
Maximilian
  • 89
  • 1
  • 7
0
votes
2 answers

R: How does to cbind a dataframe in a list by column match names? or by partial leftjoin?

my problem is: I have a list with 8 dataframes with different column names and similar rownames, so I want to cbind these dataframes by a column match. For example, in this case I need align the rows of the columns Yd;Yc;Yb;Ya. myList<-list( …
0
votes
1 answer

Curly curly passing a column name to mutate or regex_left_join returns error, could not find assignment operator `:=`

I am getting an error in console: Error :=({ : could not find function ":=" I am using a fuzzyjoin (by David Robinson) and tidyverse packages only. The function is accepted with no syntax errors. On execution the error is thrown at me. What could…
Jacek Kotowski
  • 620
  • 16
  • 49
0
votes
1 answer

Fuzzy Left Join exact + partial string match

I'm using a fuzzy_left_join function to match tables with exact + fuzzy matching. One of the match_fun arguments that I'm using involves checking if part of a string is contained inside another string. When only using exact matching, it returns the…
0
votes
2 answers

Partial match column in dataframe to create new dataframe

I'm running into an issue with encoding and partial matching. I have two data frames, A and B. A called in via UTF-8 encoding and B on Latin1. This could already be part of the issue although I'm not sure. This was the only way I knew how to import…
Tom
  • 11
  • 2