Compare 2 Text column of 2 Data frame by consecutive letter in R

Question

I've the following 2 dataset: The original dataset and the matching dataset. The matching dataset is used to match against the original dataset.

Original Dataset
MelbourneAir
MelbourneCityAir
AirMelbourne
LondonAir
londonterminal
Airportlondon
NewyorkAirport
Airnewyork
newyorkterminal
airnorway
terminalnorway
swedenair
airsweden
swedenbus
shopbanana
bananashop
bananaashop
appleshop
shopapple

Original_df = structure(list(df.Original.Data = structure(c(11L, 12L, 1L, 9L, 
10L, 4L, 13L, 2L, 14L, 15L, 3L, 20L, 18L, 5L, 19L, 17L, 8L, 7L, 
6L, 16L), .Label = c("AirMelbourne", "airnewyork", "airnorway", 
"AirportLondon", "airsweden", "appleshop", "bananaashop", "bananashop", 
"LondonAir", "LondonTerminal", "MelbourneAir", "MelbourneCityAir", 
"newyorkairport", "newyorkterminal", "norwayterminal", "shopapple", 
"shopbanana", "swedenair", "swedenbus", "terminalnorway"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

Matching Dataset
MelbourneAirport
LondonTerminal
NewYorkAirport
NorwayTerminal
SwedenAirport
BananaShop
AppleShop

Matching_df = structure(list(df.Matching.Data = structure(c(5L, 4L, 6L, 7L, 
8L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "AppleShop", "BananaShop", "LondonTerminal", 
"MelbourneAirport", "NewYorkAirport", "NorwayTerminal", "SwedenAirport"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

head(matching_df,7)

Does anyone know how I can use R and it's packages (stringdist, stringr, stringi, tidyr, dplyr or base R) to do the matching based on consecutive letters (i.e. as long as the text in original dataset matches the text in the matching dataset by 5(or 6) consecutive letters, R will match it accordingly to the following table? I've been using Excel so far to do the matching but given the limit to how much data excel can handle, using R would be more efficient.

Original Dataset	Matched Dataset
MelbourneAir	MelbourneAirport
MelbourneCityAir	MelbourneAirport
AirMelbourne	MelbourneAirport
LondonAir	LondonTerminal
londonterminal	LondonTerminal
Airportlondon	LondonTerminal
NewyorkAirport	NewYorkAirport
Airnewyork	NewYorkAirport
newyorkterminal	NewYorkAirport
airnorway	NorwayTerminal
terminalnorway	NorwayTerminal
swedenair	SwedenAirport
airsweden	SwedenAirport
swedenbus	SwedenAirport
shopbanana	BananaShop
bananashop	BananaShop
bananaashop	BananaShop
appleshop	AppleShop
shopapple	AppleShop

Thanks in advance!

Can you be more explicit in your criteria? You say you want to match by 5 or 6 consecutive letters. Do you mean if there are any 5 consecutive letters in one string that match 5 letters in the other string then they are a match? So `Airportlondon` should match `NewYorkAirport` - but it doesn't in your example output. Or are you saying you want the best match if there are multiple matches? If so how are you defining best? — SamR, Jul 13 '22 at 12:32
Hi, thanks for the question. I acknowledge that it's not possible to match 100% as the dataset gets larger and more words are included. Based on my description, indeed Airportlondon can match with NewYorkAirport. I'm not looking for a 100% match but ideally around 60%-80% match. 5 consecutive letters in one string that match 5 letters in the other string is the basic condition but if there are better conditions i'm all ears — Luther_Proton, Jul 13 '22 at 13:37

score 1 · Accepted Answer · answered Jul 13 '22 at 12:57

stringdist indeed is a way to calculate the distance between two strings.

#load stringdist package
library(stringdist)
#calculate "Optimal string alignment" distance between two strings
sapply(1:nrow(original_df), function(x) stringdist(original_df$df.Original.Data[x], matching_df$df.Matching.Data[x], method = "osa"))

distlist <- list()
for(i in 1:nrow(original_df)) {
  distlist[[i]] <- stringdist(tolower(original_df$df.Original.Data[i]), tolower(matching_df$df.Matching.Data), method = "osa")
}
minindexes <- sapply(distlist, which.min)
original_df$result <- Matching_df$df.Matching.Data[minindexes]


> original_df
   df.Original.Data           result
1      MelbourneAir MelbourneAirport
2  MelbourneCityAir MelbourneAirport
3      AirMelbourne        AppleShop
4         LondonAir   LondonTerminal
5    LondonTerminal   LondonTerminal
6     AirportLondon        AppleShop
7    newyorkairport   NewYorkAirport
8        airnewyork        AppleShop
9   newyorkterminal   LondonTerminal
10   norwayterminal   NorwayTerminal
11        airnorway       BananaShop
12   terminalnorway    SwedenAirport
13        swedenair    SwedenAirport
14        airsweden        AppleShop
15        swedenbus    SwedenAirport
16       shopbanana       BananaShop
17       bananashop       BananaShop
18      bananaashop       BananaShop
19        appleshop        AppleShop
20        shopapple        AppleShop

Several methods are available, see stringdist-metrics

this is so insightful, thank you! – Luther_Proton Jul 13 '22 at 13:46 — Luther_Proton, Jul 13 '22 at 13:46

Compare 2 Text column of 2 Data frame by consecutive letter in R

1 Answers1