I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation.
If I sample down the data into pieces, the code works and I get a (partial) dfm of n-grams of various sizes, but when I try to run the code on whole corpus, unfortunately I hit memory limits with this corpus size, and get the following error (example code for unigrams, single words):
> dfm(corpus, verbose = TRUE, stem = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features:
Error: cannot allocate vector of size 1024.0 Mb
In addition: Warning messages:
1: In unique.default(allFeatures) :
Reached total allocation of 11984Mb: see help(memory.size)
Even worse if I try to build n-grams with n > 1:
> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage 19925140 is too close to the limit
I found this related post, but it looks it was an issue with dense matrix coercion, later solved, and it doesn't help in my case.
Are there better ways to handle this with limited amount of memory, without having to break the corpus data into pieces?
[EDIT] As requested, sessionInfo() data:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3 quanteda_0.9.4
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.2 assertthat_0.1 Matrix_1.2-3 rsconnect_0.4.2 DBI_0.3.1
[7] parallel_3.2.3 tools_3.2.3 Rcpp_0.12.3 stringi_1.0-1 grid_3.2.3 chron_2.3-47
[13] lattice_0.20-33 ca_0.64