2

I would like to find information in one column based on the other column. So I have some words in one column and complete sentences in another. I would like to know whether it finds the words in those sentences. But sometimes the words are not the same so I cannot use the SQL like function. Thus I think fuzzy matching + some sort of 'like' function would be helpful as the data looks like this:

Names                    Sentences
Airplanes Sarl           Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               100% ownership of Kidco.Ltd. is the mother company.
Popsi Co.                Cola Inc. is 50% share of PopsiCo which is part of LaLo.

The data has about 2,000 rows which need a logic to find whether Airplanes Sarl is indeed in the sentence or not, and it also goes for Kidco Ltd. which is in the sentence as 'Kidco.Ltd'.

To simplify matters, I do not need it to search for ALL sentences in the column, it only needs to look for the word Kidco Ltd. and search for it in the same row of the dataframe.

I have already tried it in Python with: df.apply(lambda s: fuzz.ratio(s['Names'], s['Sentences']), axis=1)

But I got a lot of unicode /ascii errors so I gave up and would like to try in R. Any suggestions on how to go about this in R? I have seen answers on Stackoverflow that would fuzzy match all sentences in the column, which is different from what I want. Any suggestions?

Probs
  • 343
  • 2
  • 6
  • 20
  • can you link us to the answer that fuzzy matched everything ? – moodymudskipper May 29 '17 at 15:09
  • as your table is small you could try levenshtein distances. say d is the distance, n1 the number of characters in col1 and n2 the number of characters in col2. if the name is not at all in the sentence the distance should be closer to n2, if it is there exactly the distance should be n2-n1. Then you'd define a cutoff, I think it may work well. – moodymudskipper May 29 '17 at 15:11

2 Answers2

2

Maybe try tokenization + phonetic matching:

library(RecordLinkage)
library(quanteda)
df <- read.table(header=T, sep=";", text="
Names                    ;Sentences
Airplanes Sarl           ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               ;100% ownership of Kidco.Ltd. is the mother company.
Popsi Co.                ;Cola Inc. is 50% share of PopsiCo which is part of LaLo.
Popsi Co.                ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.")
f <- soundex
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co"
tokens <- lapply(tokens, f)
mapply(is.element, soundex(df$Names), tokens)
 # A614  K324  K324  P122  P122 
 # TRUE FALSE  TRUE  TRUE  TRUE 
lukeA
  • 53,097
  • 5
  • 97
  • 100
1

Here's a solution using the method I suggested in the comments, in this example it works well:

library("stringdist")

df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.",
                             "Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.",
                             "Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.",
                             "some company","It is a truth universally acknowledged...",
                             "Hello world",list(NULL)),
                     ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE)

null_elements <- which(sapply(df$Sentences,is.null))
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors
df$dist <- mapply(stringdist,df$Names,df$Sentences)
df$n2 <- nchar(df$Sentences)
df$n1 <- nchar(df$Names)
df$match_quality <- df$dist-(df$n2-df$n1)
cutoff <- 2
df$match <- df$match_quality <= cutoff
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff

# Names                                                Sentences dist n2 n1 match_quality match
# 1 Airplanes Sarl          Airplanes-Sàrl is part of Airplanes-Group Sarl.   33 47 14             0  TRUE
# 2     Kidco Ltd.      100% ownership of Kidco.Ltd. is the mother company.   42 51 10             1  TRUE
# 3      Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo.   48 56  9             1  TRUE
# 4   some company                It is a truth universally acknowledged...   36 41 12             7 FALSE
# 5    Hello world                                                     NULL   11  0 11            22    NA
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Moody_Mudskipper, the answer is really good! However, if the data in 'Sentences' is NULL then it says that there is a TRUE match. You can try it with the example you gave and then insert anything in 'Names' and leave 'Sentences' empty. – Probs May 29 '17 at 16:23
  • I think it should work fine now, though I didn't have TRUE matches in my case, I had errors if Sentences was NULL, tell me if it works. – moodymudskipper May 30 '17 at 08:33
  • I just came across this while looking for a way to do fuzzy string matching. This is a bit longer but it's got all the details you need to do a quick comparison, and allows for specification of cutoff for 'good' etc. Great help! – Adam_S Jul 25 '18 at 19:27