Removing stopwords from a user-defined corpus in R

Question

I have a set of documents:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

First I convert to a Corpus object:

documents <- Corpus(VectorSource(documents))

Then I try to remove the stopwords:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

But this last line results in the following error:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() to debug.

This has already been asked here but an answer was not given. What does this error mean?

EDIT

Yes, I am using the tm package.

Here is the output of sessionInfo():

R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit)

Mhairi McNeill · Accepted Answer · 2016-05-30T13:39:35.313

13

When I run into tm problems I often end up just editing the original text.

For removing words it's a little awkward, but you can paste together a regex from tm's stopword list.

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

edited May 30 '16 at 13:39

answered May 30 '16 at 13:28

Mhairi McNeill

1,951
11
20

Thanks very much for your reply. I'm getting the error 'string must be an atomic vector', in the line with stringr::str_replace_all. Any idea how to tackle this? – StatsSorceress Jun 12 '16 at 17:56
Aha! Just answered my own question: documents1 = paste(c(documents)) Paste that line just before the section of stopwords_regex. Thanks again! – StatsSorceress Jun 12 '16 at 18:00
Thanks for the beautiful answer first. Reversing the stopwords list before tying together helps. Like `stopwords_regex = paste(rev(stopwords('en')), collapse = '\\b|\\b')` – MItrajyoti Mar 30 '19 at 13:43

score 1 · Answer 2 · answered May 30 '16 at 13:18

1

Maybe try to use the tm_map function to transform the document. It seems to work in my case.

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

This yields

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "

answered May 30 '16 at 13:18

Ely

10,860
4
43
64

Thanks Elyasin, but I'm already using the tm package, and it's tm_map(documents, removeWords, stopwords("english")) that's throwing the error. – StatsSorceress Jun 12 '16 at 17:57
I know. But have a look at my answer more closely. I got a reasonable result and the command was `documents = tm_map(documents, content_transformer(tolower))` before removing punctuation and stop words. Try it out. – Ely Jun 12 '16 at 18:07
I looked again, and it seems I can't use tm_map at all. Sometimes, it gives no errors and I can remove the stopwords via your method, but other times it throws the same error ('the process has forked...'). I've never had an intermittent error like this before. Any ideas? – StatsSorceress Jun 12 '16 at 20:34
Which version of R are you using? On which OS? – Ely Jun 12 '16 at 20:35
Here is the output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) – StatsSorceress Jun 12 '16 at 22:30
2

try this `options(mc.cores=1)` in your .Rprofile file, or at the top of your R script. As far as I remember, from a course where participants used tm, this was a workaround preventing that weird error message. – knb Jun 13 '16 at 08:45

score 0 · Answer 3 · answered Feb 09 '18 at 22:56

0

you can use quanteda package to remove stop words, but first make sure your words are tokens and then use the following:

library(quanteda)
x<- tokens_select(x,stopwords(), selection=)

answered Feb 09 '18 at 22:56

Aakash

1
1

Shadow JA · Answer 4 · 2023-07-03T22:30:49.813

rflashtext could be an option:

library(tm)
library(rflashtext)
library(microbenchmark)
library(stringr)

documents <- c("She had toast for breakfast",
              "The coffee this morning was excellent", 
              "For lunch let's all have pancakes", 
              "Later in the day, there will be more talks", 
              "The talks on the first day were great", 
              "The second day should have good presentations too") |> tolower()

stop_words <- stopwords("en")

Output:

processor$replace_keys(documents)
[1] "    toast   breakfast"                 "  coffee   morning   excellent"        "  lunch       pancakes"               
[4] "later     day,   will     talks"       "  talks     first day   great"         "  second day     good presentations  "

# rflastext
microbenchmark(rflashtext = {
  processor <- KeywordProcessor$new(keys = stop_words, words = rep.int(" ", length(stop_words)))
  processor$replace_keys(documents)
})

Unit: microseconds
       expr     min       lq     mean   median       uq     max neval
 rflashtext 264.529 268.8515 280.9786 272.8165 282.0745 512.499   100

# stringr
microbenchmark(stringr = {
  stopwords_regex <- sprintf("\\b%s\\b", paste(stop_words, collapse = "\\b|\\b"))
  str_replace_all(documents, stopwords_regex, " ")
})

Unit: microseconds
    expr     min       lq     mean  median       uq     max neval
 stringr 646.454 650.7635 665.9317 658.328 670.7445 937.575   100

# tm 
microbenchmark(tm = {
  corpus <- Corpus(VectorSource(documents))
  tm_map(corpus, removeWords, stop_words)
})

Unit: microseconds
 expr     min      lq     mean  median      uq     max neval
   tm 233.451 239.012 253.3898 247.086 262.143 442.706   100
There were 50 or more warnings (use warnings() to see the first 50)

NOTE: I'm not considering removing punctuation for simplicity

Removing stopwords from a user-defined corpus in R

4 Answers4

Linked