3

If I have a set of sentences and I would like to extract the duplicates, I should work like in the following example:

sentences<-c("So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my",
         "So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my")

sentences[duplicated(sentences)]

which returns:

[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"        
[3] "I'm sorry I brought this upon you, my"

But in my case I have sentences that are similar to each other (due to typos, for example) and I would like to select the ones that are more similar to each other. For example:

sentences<-c("So there I was at the mercy of three monstrous trolls",
             "Today is my One Hundred and Eleventh birthday",
             "I'm sorry I brrrought this upon, my",
             "So there I was at mercy of three monstrous troll",
             "Today is One Hundred Eleventh birthday",
             "I'm sorry I brought this upon you, my")

According to this example, I would like to select one between each of the following pairs:

I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my

Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday

So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll

The levenshteinSim function in the RecordLinkage package could help me:

library(RecordLinkage)


levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])

levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])

and so on, return values near 1 for the most similar sentences. I could write a double for loop and select, e.g., those pairs of sentences that have a Levenshtein edit distance greater than 0.7 (e.g.). But, isn't there a more simple way of doing this?

Mark
  • 1,577
  • 16
  • 43

3 Answers3

2

You could calculate an approximate string distance matrix using adist, which is based on a generalized Levenstein distance, and do hierarchical clustering afterwards using hclust.

ld  <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
#                                                       x cl
# 1 So there I was at the mercy of three monstrous trolls  1
# 2         Today is my One Hundred and Eleventh birthday  2
# 3                   I'm sorry I brrrought this upon, my  3
# 4      So there I was at mercy of three monstrous troll  1
# 5                Today is One Hundred Eleventh birthday  2
# 6                 I'm sorry I brought this upon you, my  3

To find an appropriate value for h=eight in cutree we may plot the dendrogram.

plot(hc)
abline(h=10, col=2, lty=2)

enter image description here

jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

TLDR: Possibly you can use bag of word (BoW) representation and convert these sentences to vectors. Then, simply check out the correlations and eliminate the ones if their correlation are too high with another.

Bag of words
Let's think about the following sentence:

  • Jack is a handsome, handsome man

and assume our entire universe of words is in that sentence. Then, we can simply create a vector for the number of words occurring in this sentence (which is 1 per word) which is a vector with 5 features (Jack, is, a, handsome, man).

Then, the corresponding BoW representation is: [1, 1, 1, 2, 1].
Another sentence in this universe could be,

  • Jack Jack handsome handsome man

Again, we could use our 5 featured vector to represent this sentence.

[2, 0, 0, 2, 1].

Then, you can calculate this sentences' similarity in R.

# Jack is a handsome, handsome man
first <- c(1,1,1,2,1)

# Jack Jack handsome handsome man
second <- c(2,0,0,2,1)

cor(first, second, method = "pearson")
#> [1] 0.559017
eonurk
  • 507
  • 2
  • 12
0

You can generate the embeddings for each sentence & then calculate the cosine similarity between them.

Embeddings can be generated either using BERT-based models or GLOVE models.

BERT: Sentence transformer & to be very specific Semantic similarity or paraphrase mining.

GLOVE: Tokenize the sentence, Clean the stopwords, get base words using lemma, generate word embedding & merge them as one single embedding & then calculate the similarity score i.e cosine distance for the same.

A similarity score > 93 - 95% would give you all the list of most similar sentences.

DSDEV
  • 21
  • 3