Computing n-grams on large corpus using R and Quanteda

Question

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation.

If I sample down the data into pieces, the code works and I get a (partial) dfm of n-grams of various sizes, but when I try to run the code on whole corpus, unfortunately I hit memory limits with this corpus size, and get the following error (example code for unigrams, single words):

> dfm(corpus, verbose = TRUE, stem = TRUE,
      ignoredFeatures = stopwords("english"),
      removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features: 
Error: cannot allocate vector of size 1024.0 Mb

In addition: Warning messages:
1: In unique.default(allFeatures) :
  Reached total allocation of 11984Mb: see help(memory.size)

Even worse if I try to build n-grams with n > 1:

> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
     ignoredFeatures = stopwords("english"),
     removePunct = TRUE, removeNumbers = TRUE)

Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage  19925140 is too close to the limit

I found this related post, but it looks it was an issue with dense matrix coercion, later solved, and it doesn't help in my case.

Are there better ways to handle this with limited amount of memory, without having to break the corpus data into pieces?

[EDIT] As requested, sessionInfo() data:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3      quanteda_0.9.4  

loaded via a namespace (and not attached):
 [1] magrittr_1.5    R6_2.1.2        assertthat_0.1  Matrix_1.2-3    rsconnect_0.4.2 DBI_0.3.1      
 [7] parallel_3.2.3  tools_3.2.3     Rcpp_0.12.3     stringi_1.0-1   grid_3.2.3      chron_2.3-47   
[13] lattice_0.20-33 ca_0.64

What version of quanteda are you using? Can you send your sessionInfo() output? — Ken Benoit, Mar 29 '16 at 23:58
@KenBenoit I tried both with a Mac and with a Windows machine. — Federico, Mar 31 '16 at 12:17
@KenBenoit the comments were unreadable, so I edited and added to the post above. Thanks! — Federico, Mar 31 '16 at 12:27
I'd be happy to debug this specifically for your issue if you could send a link (e.g. Dropbox) to the texts. Hard to replicate and resolve otherwise. I'd delete the texts as soon as I'm done with the tests. — Ken Benoit, Mar 31 '16 at 20:59
@KenBenoit much appreciated!! I can send you the link, no problem. Is there a way to send a pvt message directly here in stackoverflow.com, or shall I DM you in twitter maybe? — Federico, Apr 01 '16 at 10:00
@KenBenoit I sent you an email with the link to the dataset and additional context. Thank you so much for looking into this! — Federico, Apr 02 '16 at 10:25

user3554004 · Answer 1 · 2016-03-31T14:50:12.390

Yes there is, exactly by breaking it into pieces, but hear me out. Instead of importing the whole corpus, import a piece of it (is it in files: then import file by file; is it in one giant txt file - fine, use readLines). Compute your n-grams, store them in another file, read next file/line, store n-grams again. This is more flexible and will not run into RAM issues (it will take quite a bit more space than the original corpus of course, depending on the value of n). Later, you can access the ngrams from the files as per usual.

Update as per comment.

As for loading, sparse matrices/arrays sounds like a good idea, come to think of it, it might be a good idea for storage too (particularly if you happen to be dealing with bigrams only). If your data is that big, you'll probably have to look into indexing anyway (that should help with storage: instead of storing words in bigrams, index all words and store the index tuples). But it also depends what your "full n-gram model" is supposed to be for. If it's to look up the conditional probability of (a relatively small number of) words in a text, then you could just do a search (grep) over the stored ngram files. I'm not sure the indexing overhead would be justified in such a simple task. If you actually need all the 12GB worth of ngrams in a model, and the model has to calculate something that cannot be done piece-by-piece, then you still need a cluster/cloud.

But one more general advice, one that I frequently give to students as well: start small. Instead of 12BG, train and test on small subsets of the data. Saves you a ton of time while you are figuring out the exact implementation and iron out bugs - and particularly if you happen to be unsure about how these things work.

Thank you for the suggestion! Once I have the n-grams stored in different .Rdata files, how do you recommend I load them up and join to create the total n-gram model? Just load them in memory and sum up the sparse matrices with the same *n* value, or is there a smarter way? — Federico, Mar 31 '16 at 11:00
Sampling is good, specially when developing, but now I want to improve the accuracy of the model; R crashes if I sample more than 5% of the dataset, which sounds quite low to me, so I was investigating a better solution to build the complete model by overcoming memory limitations. Thanks again for your input! — Federico, Apr 02 '16 at 10:27

score 1 · Answer 2 · edited May 23 '17 at 11:53

Probably too late now, but I had a very similar problem recently (n-grams, R, Quanteda and large text source). I searched for two days and could not find a satisfactory solution, posted on this forum and others and didn't get an answer. I knew I had to chunk the data and combine results at the end, but couldn't work out how to do the chunking. In the end I found a somewhat un-elegant solution that worked and answered my own question in the following post here

I sliced up the corpus using the 'tm' package VCorpus then fed the chunks to quanteda using the corpus() function.

I thought I would post it as I provide the code solution. Hopefully, it will prevent others from spending two days searching.

Computing n-grams on large corpus using R and Quanteda

2 Answers2