2

I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included?

I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.

eipi10
  • 91,525
  • 24
  • 209
  • 285
Ricky
  • 4,616
  • 6
  • 42
  • 72
  • Can you remove the terms you don't want before creating the corpus? Alternatively, can you extract the text from the corpus and remove the undesired terms and then re-generate the corpus? – eipi10 Nov 19 '14 at 04:07
  • I can't remove the terms before creating the corpus, because some of the terms only exist in the post-processed corpus and not in the raw source. If all else fails, I guess I can extract the text from the post-processed corpus, remove the terms, and rebuild the corpus, as you suggested. But that is going to be very time consuming given my data size, so hopefully there is a direct way that can minimise the steps. – Ricky Nov 19 '14 at 04:40

2 Answers2

3

You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:

library(tm)

# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))

# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern) 
  regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))

(FYI, the second line of code just above is adapted from this SO answer.)

# The pattern we'll search for
keep = "sleep|dream|die"

# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]

Here's the result of running the transformation function:

<<PlainTextDocument (metadata: 7)>>
  c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")

Here's the original text I used to create the corpus:

doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"
Community
  • 1
  • 1
eipi10
  • 91,525
  • 24
  • 209
  • 285
2

An another way of filtering a corpus; First assign your value to the meta part, say language; by looping elements of the corpus with the variable i, check whatever you want, then filter by using with these meta attribute.

corpusz[[i]]$meta["language"] <- 'tur'

idx <- meta(corpusz, "language") ==  'tur'
filtered <- corpusz[idx]

Now filtered containes only the corpus elements we want.

Vezir
  • 101
  • 7