0

I am trying to do some dictionary analysis in R using the quanteda package,

toks_label <- tokens_lookup(toks, 
                            dict, 
                            valuetype = "regex",
                            levels = 1,
                            nested_scope = "dictionary")

but when I run this code, R returns the following error message:

Error: impossible to allocate a vector of size 58.8GB

I have gathered that it is a problem of memory available for R. I work with Rstudio (version 1.2.5042) and R 4.0.0 on Windows 10, with 12GB RAM on my pc, and a hard drive of 1To that is virtually empty (740Gb available). How can I force R to use some of the space on my hard drive as virtual memory?

I have already tried a couple of things: 1) I have edited the project .Rprofile so that it starts with memory.limit() set at 512000 (for 500Gb, is that right?), and; 2) I have edited my .Renviron file to include an argument R_MAX_VSIZE=500Gb. None of this has worked...

I have also tried lowering my ambitions with my dictionary analysis: 1) I have tried running the full dictionary (34 keys and around 300 entries) on a subset of the corpus. Didn't work. 2) I have tried running part of the dictionary on the full corpus and it worked. I conclude from this that my dictionary is too big. Is there a way I could chunk it, or iterate over it?

Phil
  • 7,287
  • 3
  • 36
  • 66
SebasComm
  • 35
  • 4
  • 1
    I suspect that you have phrasal patterns in the dictionary that produce large number of combinations. like `* *`. You can check this using `quanteda:::pattern2list()`. – Kohei Watanabe May 19 '20 at 16:14
  • @KoheiWatanabe, thanks for your answer. I have tried to run `quanteda:::pattern2list()` with the entire dictionary, but it reached my memory limit as well. I then tried to run it on the first section of my dictionary, as follows: `quanteda:::pattern2list(dict_topicsSimpler[["LIQUIDITY"]], types = typ, valuetype = "regex", concatenator = " ", levels = 2, case_insensitive = TRUE)` but that returned a "List of 0". – SebasComm May 19 '20 at 20:30
  • 1
    "List of 0" means no matching patterns in the section. You can check all the sections to find where is the problem. – Kohei Watanabe May 21 '20 at 05:57

1 Answers1

1

In most cases, a little judicious rewriting of your code will allow you to read smaller chunks of data in, process it, write the results to memory, delete the input (thus freeing up RAM), and read the next chunk. Which is to say, manually storing your large datasets both on input and output sides of the function.

Otherwise, depending on your paging size, you could set R_MAX_MEM_SIZE as described here, or install the package ff and learn to use that to make use of disk space as though it were RAM ("sort of" as the description says).

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • Thanks Carl. I have tried the R_MAX_MEM_SIZE trick and it did not work. I have dived into the ff package documentation, and--as the R beginner that I am--I honestly did not see how to use it in my process. – SebasComm May 19 '20 at 09:06