0

I've the following 2 dataset: The original dataset and the matching dataset. The matching dataset is used to match against the original dataset.

Original Dataset
MelbourneAir
MelbourneCityAir
AirMelbourne
LondonAir
londonterminal
Airportlondon
NewyorkAirport
Airnewyork
newyorkterminal
airnorway
terminalnorway
swedenair
airsweden
swedenbus
shopbanana
bananashop
bananaashop
appleshop
shopapple
Original_df = structure(list(df.Original.Data = structure(c(11L, 12L, 1L, 9L, 
10L, 4L, 13L, 2L, 14L, 15L, 3L, 20L, 18L, 5L, 19L, 17L, 8L, 7L, 
6L, 16L), .Label = c("AirMelbourne", "airnewyork", "airnorway", 
"AirportLondon", "airsweden", "appleshop", "bananaashop", "bananashop", 
"LondonAir", "LondonTerminal", "MelbourneAir", "MelbourneCityAir", 
"newyorkairport", "newyorkterminal", "norwayterminal", "shopapple", 
"shopbanana", "swedenair", "swedenbus", "terminalnorway"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))
Matching Dataset
MelbourneAirport
LondonTerminal
NewYorkAirport
NorwayTerminal
SwedenAirport
BananaShop
AppleShop
Matching_df = structure(list(df.Matching.Data = structure(c(5L, 4L, 6L, 7L, 
8L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "AppleShop", "BananaShop", "LondonTerminal", 
"MelbourneAirport", "NewYorkAirport", "NorwayTerminal", "SwedenAirport"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

head(matching_df,7)

Does anyone know how I can use R and it's packages (stringdist, stringr, stringi, tidyr, dplyr or base R) to do the matching based on consecutive letters (i.e. as long as the text in original dataset matches the text in the matching dataset by 5(or 6) consecutive letters, R will match it accordingly to the following table? I've been using Excel so far to do the matching but given the limit to how much data excel can handle, using R would be more efficient.

Original Dataset Matched Dataset
MelbourneAir MelbourneAirport
MelbourneCityAir MelbourneAirport
AirMelbourne MelbourneAirport
LondonAir LondonTerminal
londonterminal LondonTerminal
Airportlondon LondonTerminal
NewyorkAirport NewYorkAirport
Airnewyork NewYorkAirport
newyorkterminal NewYorkAirport
airnorway NorwayTerminal
terminalnorway NorwayTerminal
swedenair SwedenAirport
airsweden SwedenAirport
swedenbus SwedenAirport
shopbanana BananaShop
bananashop BananaShop
bananaashop BananaShop
appleshop AppleShop
shopapple AppleShop

Thanks in advance!

Luther_Proton
  • 348
  • 1
  • 7
  • Can you be more explicit in your criteria? You say you want to match by 5 or 6 consecutive letters. Do you mean if there are any 5 consecutive letters in one string that match 5 letters in the other string then they are a match? So `Airportlondon` should match `NewYorkAirport` - but it doesn't in your example output. Or are you saying you want the best match if there are multiple matches? If so how are you defining best? – SamR Jul 13 '22 at 12:32
  • Hi, thanks for the question. I acknowledge that it's not possible to match 100% as the dataset gets larger and more words are included. Based on my description, indeed Airportlondon can match with NewYorkAirport. I'm not looking for a 100% match but ideally around 60%-80% match. 5 consecutive letters in one string that match 5 letters in the other string is the basic condition but if there are better conditions i'm all ears – Luther_Proton Jul 13 '22 at 13:37

1 Answers1

1

stringdist indeed is a way to calculate the distance between two strings.

#load stringdist package
library(stringdist)
#calculate "Optimal string alignment" distance between two strings
sapply(1:nrow(original_df), function(x) stringdist(original_df$df.Original.Data[x], matching_df$df.Matching.Data[x], method = "osa"))

distlist <- list()
for(i in 1:nrow(original_df)) {
  distlist[[i]] <- stringdist(tolower(original_df$df.Original.Data[i]), tolower(matching_df$df.Matching.Data), method = "osa")
}
minindexes <- sapply(distlist, which.min)
original_df$result <- Matching_df$df.Matching.Data[minindexes]


> original_df
   df.Original.Data           result
1      MelbourneAir MelbourneAirport
2  MelbourneCityAir MelbourneAirport
3      AirMelbourne        AppleShop
4         LondonAir   LondonTerminal
5    LondonTerminal   LondonTerminal
6     AirportLondon        AppleShop
7    newyorkairport   NewYorkAirport
8        airnewyork        AppleShop
9   newyorkterminal   LondonTerminal
10   norwayterminal   NorwayTerminal
11        airnorway       BananaShop
12   terminalnorway    SwedenAirport
13        swedenair    SwedenAirport
14        airsweden        AppleShop
15        swedenbus    SwedenAirport
16       shopbanana       BananaShop
17       bananashop       BananaShop
18      bananaashop       BananaShop
19        appleshop        AppleShop
20        shopapple        AppleShop

Several methods are available, see stringdist-metrics

enter image description here

gaut
  • 5,771
  • 1
  • 14
  • 45