Questions tagged [fuzzyjoin]

An R package for joining tables together on inexact matching.

Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance, regular expression, or custom matching functions. Uses similar syntax as dplyr's joins.

161 questions
2
votes
3 answers

Join two dataframes on one column that contains substring of other

I am trying to left-join df2 onto df1. df1 is my dataframe of interest, df2 contains additional information I need. Example: #df of interest onto which the other should be joined key1 <- c("London", "Paris", "Berlin", "Delhi") other_stuff <-…
Auream
  • 55
  • 4
2
votes
2 answers

Joining two dataframes on a condition (grepl)

I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes, df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), …
Jess CT
  • 23
  • 3
2
votes
1 answer

test if words are in a string (grepl, fuzzyjoin?)

I need to do a match and join on two data frames if the string from two columns of one data frame are contained in the string of a column from a second data frame. Example dataframe: First <- c("john", "jane", "jimmy", "jerry", "matt", "tom",…
tchoup
  • 971
  • 4
  • 11
2
votes
1 answer

Function to `interval_left_join` multiple dataframes

I have several dataframes I want to interval_left_join. I could in theory join the dataframes step-by-step but would prefer a function to perform the joins in one go: Data: df1 <- data.frame( line = 1:4, key = c("a", "b", NA, "a"), start =…
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
2
votes
1 answer

Limiting the amount of fuzzy string comparisons by comparing by subgroup

I have two datasets as follows: DT1 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,…
Tom
  • 2,173
  • 1
  • 17
  • 44
2
votes
1 answer

R fuzzy_left_join with time

I am trying to merge two tables with the conditions "ccode" = "Ticker", "Date"="Date" and "Time"= "Timestamp". However, if there is not an exact match of "Time" it should look at "Timeint" (up to -2 minutes). As this is something I can't do with…
tissler56
  • 23
  • 3
2
votes
1 answer

Join data frames based fuzzy matching of strings

library(tidyverse) library(fuzzyjoin) df1 <- tibble(col1 = c("Apple Shipping", "Banana Shipping", "FedEX USA Ground", "FedEx USA Commercial", "FedEx International"), col2 = 1:5) #> # A tibble: 5 x 2 #> col1 …
Display name
  • 4,153
  • 5
  • 27
  • 75
2
votes
2 answers

Fuzzy join with 2 large data frames

Here is my example: id <- 1:5 names_1 <- c("hannah", "marcus", "fred", "joe", "lara") df_1 <- data.frame(id, names_1) df_1$phonenumberFound <- NA names_2 <- c("hannah", "markus", "fredd", "joey", "paul", "mary", "olivia") phone <- c(123, 234, 345,…
Rami Al-Fahham
  • 617
  • 1
  • 6
  • 10
2
votes
2 answers

R fuzzyjoin on most recent previous record

I want to join two tables A & B by ID and find in B the most recent date that is anterior to A[date]. After some search it seems that fuzzyjoin allow to join on date ranges : library(fuzzyjoin) fuzzy_left_join(A, B, by = c("ID" =…
cicero
  • 508
  • 3
  • 15
2
votes
5 answers

inner_join() with range of values for one of the keys (year)

I have two datasets that are formatted like this: df1 #> Artist Album Year #> 1 Beatles Sgt. Pepper's 1967 #> 2 Rolling Stones Sticky Fingers 1971 and df2 #> Album Year Producer #> 1 Sgt. Pepper's 1966…
Jeremy K.
  • 1,710
  • 14
  • 35
2
votes
1 answer

Join two datasets with nearst start time with interval fuzzy join

I am trying to join two large datasets in R with 'fuzzyjoin:interval_inner_join'. my goal is to join these to table base on nearest start and end time. # first dataset viewing <- data.frame(stringsAsFactors=FALSE, id = c("100-16",…
DanG
  • 689
  • 1
  • 16
  • 39
2
votes
1 answer

Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table

Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key…
Tom
  • 2,173
  • 1
  • 17
  • 44
2
votes
1 answer

Doing a "fuzzyjoin" (and non-fuzzyjoin) in combination with a merge in data.table

I am using multiple databases. For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year. Mostly when I merge these datasets I simply do: dfA<- merge(dfA, dfB, by= "matchcode",…
Tom
  • 2,173
  • 1
  • 17
  • 44
2
votes
1 answer

Simultaneous fuzzy and non-fuzzy join

Say I have this data frame: # Set random seed set.seed(33550336) # Number of IDs n <- 5 # Create data frames df <- data.frame(ID = rep(1:n, each = 10), loc = seq(10, 100, by =10)) # ID loc # 1 1 10 # 2 1 20 # 3 1 30 #…
Dan
  • 11,370
  • 4
  • 43
  • 68
2
votes
1 answer

Conditional joining data frames R

I gotta a somewhat simple problem that I'm not being able to grasp correctly. I have two data frames, the first one containing just dates (every month for a bunch of years), the second one also with dates and some other data, but just the months for…
1 2
3
10 11