Remove rows containing identical or word-permuted sentences from a data frame in R

Question

I have a data frame with text

TERM
good morning
hello
morning good
you're welcome
hello
hi

I would like to filter out all duplicates and all with the same words but in different order. So that I get:

TERM
good morning
hello
you're welcome
hi

I know how to get the distance of two words with stringdist.

stringdist(stringOriginal,stringCompare,method=qgram)

But since I have very long data frames I don't want to loop through all entries.

How can I filter out the similar terms?

Thx Joerg

You could devise a brute force method with `strsplit` and the set functions `union` and `intersect` or `setdiff`. — lmo, Dec 20 '16 at 13:45
It would be useful to modify the question to include a small example of the kind of data frame you are starting with, along with the desired output. — Keith Hughitt, Dec 20 '16 at 13:49
Using `stringdist` you could do: `library(stringdist); sdm <- stringdistmatrix(DF$TERM, DF$TERM, method = "qgram", useNames = "strings"); sdm[!duplicated(sdm),]` — Steven Beaupré, Dec 20 '16 at 14:52

G. Grothendieck · Answer 1 · 2016-12-20T15:43:03.303

Break it up into words, sort the words in each record and keep rows for which the sorted words are not duplicates. No packages are used.

subset(DF, !duplicated(lapply(strsplit(TERM, " "), sort)))

giving:

            TERM
1   good morning
2          hello
4 you're welcome
6             hi

Note: The input in reproducible form is:

Lines <- "TERM
good morning
hello
morning good
you're welcome
hello
hi"
DF <- read.csv(text = Lines, as.is = TRUE, strip.white = TRUE)

1 Answers1