1

Text mining with package tm, using removeWords(). I have a list of about 500 relevant words out of several thousand total. Can I use use removeWords() to reverse the logic and remove the words from the Corpus that are NOT in the list?

With Perl, I could do something like this:

$diminishedText = (fullText =! s/$wordlist//g); #not tested

In R, this removes the words in the word list:

text <- tm_map(text, removeWords, wordList)

What would be the correct syntax for doing something like this?

text <- tm_map(text, removeWords, not in wordList)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
ccc31807
  • 761
  • 8
  • 17
  • 3
    A minimal working example is expected. A data set (Corpus in this case) and a small list of words to keep in R vector format. The **tm** package has built in Corpus data you can use. Failure to do this is grounds for a question being closed. – Tyler Rinker Jan 21 '15 at 14:42
  • I voted to close this question as the OP has not provided a minimal working example. – Tyler Rinker Jan 26 '15 at 16:27

3 Answers3

1

This feels pretty klunky, but might work. A different possibility is at the end.

library(tm)
library(qdap); library(gtools)
library(stringr)

docs <- c("cat", "dog", "mouse", "oil", "crude", "tanker") # starting documents

EDIT I ran across this approach: tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]

keepWords <- c("oil", "crude", "tanker") # choose the words to keep from the starting documents
keeppattern <- paste0(keepWords, collapse = "|") # create a regex pattern of the keepWords
Text <- unlist(str_extract_all(string = docs, pattern = keeppattern)) # remove only the keepWords, as a vector
Text.tdm <- TermDocumentMatrix(Text) # create the tdm based on keepWords only

Here is another possibility, but I did not work it through. R remove stopwords from a character vector using %in%

EDIT: Another method:

tdm.keep <- Text.tdm[rownames(Text.tdm)%in%keepWords, ]

'%nin%' <- Negate('%in%') # assign to an operator the opposite of %in%
Text <- tm_map(crude, removeWords(crude %nin% keepWords)) 
# Error because removeWords can't take a logical argument
Community
  • 1
  • 1
lawyeR
  • 7,488
  • 5
  • 33
  • 63
1

The text analysis package quanteda has functions for feature selection that are both positive (keep) and negative (remove). Here is the example where we want to keep just a set of economic words, from the US presidential inaugural corpus:

require(quanteda)
dfm(inaugTexts[50:57], keptFeatures = c("tax*", "econom*", "mone*"), verbose = FALSE)
# Document-feature matrix of: 8 documents, 5 features.
# 8 x 5 sparse Matrix of class "dfmSparse"
#               features
# docs           economic taxes tax economy money
#   1985-Reagan         4     2   4       5     1
#   1989-Bush           0     0   0       0     1
#   1993-Clinton        0     0   0       3     0
#   1997-Clinton        0     0   0       2     0
#   2001-Bush           0     1   0       2     0
#   2005-Bush           1     0   0       0     0
#   2009-Obama          0     0   0       3     0
#   2013-Obama          2     0   1       1     0

Here the match was using the default "glob" format, but fixed and regular expression matches for feature selection are also possible. See ?dfm and ?selectFeatures.

Ken Benoit
  • 14,454
  • 27
  • 50
0

Maybe you can brute-force it.

Download some dictionary and remove the words that are in wordList from it.

Try passing that dictionary in tm_map().

ZygD
  • 22,092
  • 39
  • 79
  • 102