Questions tagged [fuzzyjoin]

An R package for joining tables together on inexact matching.

Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance, regular expression, or custom matching functions. Uses similar syntax as dplyr's joins.

161 questions
1
vote
1 answer

Comparing two columns of two dataframes based on partial string match

I have two sample data frames, df1 and df2 as given below. df1 has the list of selected tennis match fixtures with player names(player1_name,player_name2) and the date they were played. Full names are used here for players. df2 has the list of all…
1
vote
0 answers

Fuzzy left join person full names in R - handling tricky edge cases (cannot install fuzzyjoin)

Here is example data with person full names from two tables that need to be left-joined together, with df1 and the left-hand side table, and df2 as the right-hand side: df1 <- data.frame(fullName = 'Michael Gadson', age = 53) %>% …
Canovice
  • 9,012
  • 22
  • 93
  • 211
1
vote
1 answer

Grouping two data frames using stringdist_join

I am currently working on a project and have reached a problem... I am trying to match two data frames based on a candidate's name. I have managed to do this, however with anything more than a max_dist of 2 I start to get duplicate entries. However,…
1
vote
1 answer

How to match two strings by semantics?

library(dplyr) library(fuzzyjoin) df1 <- data.frame(x = c("Socks", "Mouse")) df2 <- data.frame(y = c("Sock", "House")) stringdist_left_join(df1, df2, by = c(x = "y"), max_dist = 1, ignore_case =…
Mirela
  • 11
  • 3
1
vote
1 answer

Join on substring of a key in R

I'm trying to join two tables with some codes in a way where in one column, the keys might be a subset of the original key. Event id date ProductId quantity a xyz 1234567 30 a abc 5826811 20 b def 3619100 10 b ghi 9268420…
hbabbar
  • 947
  • 4
  • 15
  • 33
1
vote
2 answers

Anti_join by pattern

Tibble A contains keys/patterns that may be found in tibble B. My goal is to identify keys/patterns that were not found in tibble B. I want to anti-join two tables by a matching pattern: A <- tibble( colA = c("B12", "B19", "B202", "B87", "B61",…
amk
  • 349
  • 1
  • 9
1
vote
1 answer

Join a value onto an existing data.table without an exact match (matching within an interval)

I have a column in a data.table (DT1) that is a decreasing count (let's say pol_count), and another column for the average age in a population (say AverageAge). I am trying to take my value of pol_count (say 400) and my value of AverageAge (say 85)…
1
vote
0 answers

data.table fuzzy and non-fuzzy matching with characters strings

So my question is practically the same as Lyngbakr's question in which I have two very large data sets and need to join them through exact matches in some columns and fuzzy matches in others. I want the matches to be exact in the date of birth…
cach dies
  • 331
  • 1
  • 14
1
vote
1 answer

Fuzzyjoin with two columns with only one pair of columns requiring fuzzyjoin syntax

I am attempting to inner_join two data frames, each with three columns. The first data frame contains date, variable names, and forecast values while the second data frame contains date, variable names, and actual values. The purpose of my join is…
user3720887
  • 719
  • 1
  • 11
  • 18
1
vote
3 answers

Banding Variable in R

For the following code: x <- data.frame(year = c(1730, 1860, 1941, 2011)) century_bands <- data.frame(min_year = c(1700, 1800, 1900, 2000), max_year = c(1799, 1899, 1999, 2099), century_name =…
Alan
  • 619
  • 6
  • 19
1
vote
3 answers

How to semi_join two dataframes by string column with one being colon-separated

I have two dataframes, dfa and dfb: dfa <- data.frame( gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"), id = c(1:5) ) dfb <- data.frame( gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"), id = c(6:10) ) which look like…
MonkeyBack
  • 61
  • 6
1
vote
1 answer

How do I do one fuzzy and one exact match in a dataframe?

I want to be able to fuzzy match one column and exact match another column. Say I df1 looks like this: And df2 looks like this: I want to fuzzy match the "Name" but exact match the "Year." So "Ashley" and "Ashlee" would be a match. This is what I…
hy9fesh
  • 589
  • 2
  • 15
1
vote
1 answer

Fuzzy join strings on multiple columns in one dataset

I want to fuzzy match one column (df2$brands) to many other colums (df1$F6_1:f6_12) containing the same strings with some small spelling errors. I have two datasets: df1: df1 <- structure(list(F6_1 = c("Braand1", "Brand2", "Brand3", "Brand4",…
jrabensc
  • 25
  • 6
1
vote
2 answers

Match some columns exactly, and some partially with inner_join

I have two dataframes from different sources that refer to the same people, but due to errors from self-reported data, the dates may be slightly off. Example data: df1 <- data.frame(name= c("Ann", "Betsy", "Charlie", "Dave"), dob=…
pgcudahy
  • 1,542
  • 13
  • 36
1
vote
1 answer

Fuzzy merge on multiple variables (all but one with no mispellings)

I need to match two datasets on three variables. Two of the three variables do not present misspellings (by design). The fuzzy match is required only for the third variable. The standard fuyyzmerge generate some issues by fuzzy-joining all three…
MCS
  • 1,071
  • 9
  • 23