text mining with tm, remove words NOT in a list

Question

Text mining with package tm, using removeWords(). I have a list of about 500 relevant words out of several thousand total. Can I use use removeWords() to reverse the logic and remove the words from the Corpus that are NOT in the list?

With Perl, I could do something like this:

$diminishedText = (fullText =! s/$wordlist//g); #not tested

In R, this removes the words in the word list:

text <- tm_map(text, removeWords, wordList)

What would be the correct syntax for doing something like this?

text <- tm_map(text, removeWords, not in wordList)

A minimal working example is expected. A data set (Corpus in this case) and a small list of words to keep in R vector format. The **tm** package has built in Corpus data you can use. Failure to do this is grounds for a question being closed. — Tyler Rinker, Jan 21 '15 at 14:42
I voted to close this question as the OP has not provided a minimal working example. — Tyler Rinker, Jan 26 '15 at 16:27

score 1 · Answer 1 · edited May 23 '17 at 11:33

This feels pretty klunky, but might work. A different possibility is at the end.

library(tm)
library(qdap); library(gtools)
library(stringr)

docs <- c("cat", "dog", "mouse", "oil", "crude", "tanker") # starting documents

EDIT I ran across this approach: tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]

keepWords <- c("oil", "crude", "tanker") # choose the words to keep from the starting documents
keeppattern <- paste0(keepWords, collapse = "|") # create a regex pattern of the keepWords
Text <- unlist(str_extract_all(string = docs, pattern = keeppattern)) # remove only the keepWords, as a vector
Text.tdm <- TermDocumentMatrix(Text) # create the tdm based on keepWords only

Here is another possibility, but I did not work it through. R remove stopwords from a character vector using %in%

EDIT: Another method:

tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]

'%nin%' <- Negate('%in%') # assign to an operator the opposite of %in%
Text <- tm_map(crude, removeWords(crude %nin% keepWords)) 
# Error because removeWords can't take a logical argument

score 1 · Answer 2 · answered Nov 22 '15 at 23:54

The text analysis package quanteda has functions for feature selection that are both positive (keep) and negative (remove). Here is the example where we want to keep just a set of economic words, from the US presidential inaugural corpus:

require(quanteda)
dfm(inaugTexts[50:57], keptFeatures = c("tax*", "econom*", "mone*"), verbose = FALSE)
# Document-feature matrix of: 8 documents, 5 features.
# 8 x 5 sparse Matrix of class "dfmSparse"
#               features
# docs           economic taxes tax economy money
#   1985-Reagan         4     2   4       5     1
#   1989-Bush           0     0   0       0     1
#   1993-Clinton        0     0   0       3     0
#   1997-Clinton        0     0   0       2     0
#   2001-Bush           0     1   0       2     0
#   2005-Bush           1     0   0       0     0
#   2009-Obama          0     0   0       3     0
#   2013-Obama          2     0   1       1     0

Here the match was using the default "glob" format, but fixed and regular expression matches for feature selection are also possible. See ?dfm and ?selectFeatures.

score 0 · Answer 3 · edited Jul 28 '15 at 19:14

0

Maybe you can brute-force it.

Download some dictionary and remove the words that are in wordList from it.

Try passing that dictionary in tm_map().

edited Jul 28 '15 at 19:14

ZygD

22,092
39
79
102

answered Jul 28 '15 at 19:04

Amanpreet Singh

45
2
7

text mining with tm, remove words NOT in a list

3 Answers3