2

I have a data frame with text

TERM
good morning
hello
morning good
you're welcome
hello
hi

I would like to filter out all duplicates and all with the same words but in different order. So that I get:

TERM
good morning
hello
you're welcome
hi

I know how to get the distance of two words with stringdist.

stringdist(stringOriginal,stringCompare,method=qgram)

But since I have very long data frames I don't want to loop through all entries.

How can I filter out the similar terms?

Thx Joerg

Keith Hughitt
  • 4,860
  • 5
  • 49
  • 54
JoergP
  • 1,349
  • 2
  • 13
  • 28
  • You could devise a brute force method with `strsplit` and the set functions `union` and `intersect` or `setdiff`. – lmo Dec 20 '16 at 13:45
  • It would be useful to modify the question to include a small example of the kind of data frame you are starting with, along with the desired output. – Keith Hughitt Dec 20 '16 at 13:49
  • 1
    Using `stringdist` you could do: `library(stringdist); sdm <- stringdistmatrix(DF$TERM, DF$TERM, method = "qgram", useNames = "strings"); sdm[!duplicated(sdm),]` – Steven Beaupré Dec 20 '16 at 14:52

1 Answers1

2

Break it up into words, sort the words in each record and keep rows for which the sorted words are not duplicates. No packages are used.

subset(DF, !duplicated(lapply(strsplit(TERM, " "), sort)))

giving:

            TERM
1   good morning
2          hello
4 you're welcome
6             hi

Note: The input in reproducible form is:

Lines <- "TERM
good morning
hello
morning good
you're welcome
hello
hi"
DF <- read.csv(text = Lines, as.is = TRUE, strip.white = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341