If I have a set of sentences and I would like to extract the duplicates, I should work like in the following example:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my",
"So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my")
sentences[duplicated(sentences)]
which returns:
[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"
[3] "I'm sorry I brought this upon you, my"
But in my case I have sentences that are similar to each other (due to typos, for example) and I would like to select the ones that are more similar to each other. For example:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brrrought this upon, my",
"So there I was at mercy of three monstrous troll",
"Today is One Hundred Eleventh birthday",
"I'm sorry I brought this upon you, my")
According to this example, I would like to select one between each of the following pairs:
I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my
Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday
So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll
The levenshteinSim
function in the RecordLinkage
package could help me:
library(RecordLinkage)
levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])
levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])
and so on, return values near 1 for the most similar sentences. I could write a double for loop
and select, e.g., those pairs of sentences that have a Levenshtein edit distance greater than 0.7 (e.g.). But, isn't there a more simple way of doing this?