subset/filter based on a frequency table

Question

I have a df with some text data e.g.

words <- data.frame(terms = c("qhick brown fox",
                              "tom dick harry", 
                              "cats dgs",
                              "qhick black fox"))

I'm already able to subset based on any row that contains a spelling error:

library(qdap)
words[check_spelling(words$terms)$row,,drop=F]

But given I have a lot of text data I want to filter only on spelling errors that occur more frequently:

> sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T)
qhick 
    2

So I now know that that "qhick" is a common misspelling.

How could I then subset words based on this table? So only return rows that contain "qhick"?

Mike H. · Accepted Answer · 2017-06-30T03:12:02.397

1

The words themselves are the names of your sort() function. If you have only one name you can do:

top_misspelled <- sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T)

words[grepl(names(top_misspelled), words$terms), , drop = F]
#            terms
#1 qhick brown fox
#4 qhick black fox

But if you have multiple you could collapse them together to build a grepl lookup like:

words[grepl(paste0(names(top_misspelled), collapse = "|"), words$terms), ,drop = F]

A non-regex option would also be to split each row into words and then if any of the words in the row matches your strings of interest, return that row:

words[sapply(strsplit(as.character(words[,"terms"]), split=" "), function(x) any(x %in% names(top_misspelled))),
      ,drop = F]

#            terms
#1 qhick brown fox
#4 qhick black fox

edited Jun 30 '17 at 03:12

answered Jun 30 '17 at 02:15

Mike H.

13,960
2
29
39

Thanks for answering and sorry for unaccepting. Actually, I want to leave it open for a little while since the regex approach could cause unexpected behavior when a word string is part of another larger word e.g. "cat" in "catastrophic". – Doug Fir Jun 30 '17 at 02:53
No problem, another idea would be to split each row using `strsplit` and then use `sapply` to check if any of the elements in the row match – Mike H. Jun 30 '17 at 03:05
Thank you that does the trick! I wonder if there's a "dplyr esque" way of doing this, since I think I can personally follow what's hapening with the non regex method but it's tricky to read. Anyway, thanks again – Doug Fir Jun 30 '17 at 03:45

subset/filter based on a frequency table

1 Answers1