How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

Question

I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included?

I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.

Can you remove the terms you don't want before creating the corpus? Alternatively, can you extract the text from the corpus and remove the undesired terms and then re-generate the corpus? — eipi10, Nov 19 '14 at 04:07
I can't remove the terms before creating the corpus, because some of the terms only exist in the post-processed corpus and not in the raw source. If all else fails, I guess I can extract the text from the post-processed corpus, remove the terms, and rebuild the corpus, as you suggested. But that is going to be very time consuming given my data size, so hopefully there is a direct way that can minimise the steps. — Ricky, Nov 19 '14 at 04:40

score 3 · Accepted Answer · edited May 23 '17 at 10:32

You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:

library(tm)

# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))

# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern) 
  regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))

(FYI, the second line of code just above is adapted from this SO answer.)

# The pattern we'll search for
keep = "sleep|dream|die"

# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]

Here's the result of running the transformation function:

<<PlainTextDocument (metadata: 7)>>
  c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")

Here's the original text I used to create the corpus:

doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"

score 2 · Answer 2 · answered Feb 19 '16 at 09:44

An another way of filtering a corpus; First assign your value to the meta part, say language; by looping elements of the corpus with the variable i, check whatever you want, then filter by using with these meta attribute.

corpusz[[i]]$meta["language"] <- 'tur'

idx <- meta(corpusz, "language") ==  'tur'
filtered <- corpusz[idx]

Now filtered containes only the corpus elements we want.

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

2 Answers2

Linked