R remove stopwords from a character vector using %in%

Question

I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary.

library(plyr)
library(tm)

stopWords <- stopwords("en")
class(stopWords)

df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."

head(df1)
df1$string1 <- tolower(df1$string1)
str1 <-  strsplit(df1$string1[5], " ")

> !(str1 %in% stopWords)
[1] TRUE

This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords vector.

What am I doing wrong?

The problem is obvious: string nbr 5 is grammatically incorrect. :-) . Ok, well, I think Arun's e right track, assuming that "word" strictly means a string of characters with no whitespace. After running his code on all elements of `df1$string`, you could do `unique` if you just want a list, not quantities, of the words. — Carl Witthoft, Mar 06 '13 at 18:58

Arun · Accepted Answer · 2013-03-06T17:24:37.580

15

You are not accessing the list properly and you're not getting the elements back from the result of %in% (which gives a logical vector of TRUE/FALSE). You should do something like this:

unlist(str1)[!(unlist(str1) %in% stopWords)]

(or)

str1[[1]][!(str1[[1]] %in% stopWords)]

For the whole data.frame df1, you could do something like:

'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
    t <- unlist(strsplit(x, " "))
    t[t %nin% stopWords]
})

# [[1]]
# [1] "string"  "string."
# 
# [[2]]
# [1] "string"   "slightly" "string." 
# 
# [[3]]
# [1] "string"  "string."
# 
# [[4]]
# [1] "string"   "slightly" "shorter"  "string." 
# 
# [[5]]
# [1] "string"   "string"   "strings."

edited Mar 06 '13 at 17:24

answered Mar 06 '13 at 17:18

Arun

116,683
26
284
387

I didn't realize str1 was outputting as a list, I assumed it was a vector, thank you. – screechOwl Mar 06 '13 at 17:26
2

Thanks for using `Negate` -- I'd completely forgotten about the `funprog` suite of goodies. – Carl Witthoft Mar 06 '13 at 20:55
1

Using `setdiff` would be even simpler, and you should probably use `lapply` on the results of `strsplit`: `lapply(strsplit(df1$string, " "), setdiff, stopWords)`. The only disadvantage is you get unique words. – hadley Mar 06 '13 at 22:30
1

`setdiff` calls `%in%` (exactly `match(x, y, 0L) == 0L`). – Artem Klevtsov Jan 05 '16 at 08:53

Artem Klevtsov · Answer 2 · 2016-01-05T09:05:11.443

First. You should unlist str1 or use lapply if str1 is vector:

!(unlist(str1) %in% words)
#>  [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

Second. Complex solution:

string <- c("This string is a string.",
            "This string is a slightly longer string.",
            "This string is an even longer string.",
            "This string is a slightly shorter string.",
            "This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
    stopifnot(is.character(string), is.character(words))
    spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
    vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string."                  "string slightly longer string."  "string even longer string."     
#> [4] "string slightly shorter string." "string longest string strings."

score 0 · Answer 3 · answered Dec 09 '20 at 08:36

Came across this question when I was working on something similar.

Though this has been answered already, I just thought to put up a concise line of code which I used for my problem as well - which will help eliminate all the stop words directly in your dataframe:

df1$string1 <- unlist(lapply(df1$string1, function(x) {paste(unlist(strsplit(x, " "))[!(unlist(strsplit(x, " ")) %in% stopWords)], collapse=" ")}))

R remove stopwords from a character vector using %in%

3 Answers3

Linked