1

I'm running a 3.5 GB data frame through the BaseR nls function trying to fit a sigmoid function and getting out of memory errors on a Windows computer with 32 GB of RAM. In short, I'm wondering why the nls function needs so much RAM to complete and if there's anything I can do about it.

Here is my code to create fake data:

print(num_events <- 6e4) # increase or decrease to test crashing

set.seed(10)
library(tidyr)
library(dplyr)
library(pryr) # for object_size

fake_data <- expand_grid( # get all combinations of below vectors
      event_num = 1:num_events,
      typical_ms_variation = seq(-100, 100, by = 50),
      typical_level_variation = seq(0.8, 1.2, by = 0.1)
   ) %>% inner_join(
      data.frame( # each event has some random variation
         event_num = 1:num_events,
         event_ms_variation = runif(num_events, -100, 100),
         event_level_variation = runif(num_events, 0.9, 1.1)
       ), 
       by = "event_num"
     ) %>% 
   mutate(
      ms_variation = typical_ms_variation + event_ms_variation,
      level_variation = typical_level_variation * event_level_variation
   ) %>%
   select(-contains("typical"), -contains("event")) %>%
   expand_grid(# get all combinations of vectors below
      measurement_ms = seq(1, 2000, by = 25)
   ) %>%
      mutate( # sigmoid function
         measured_level = 0 + 10 * level_variation /
            ( 1 + exp(0.015 *(800 + ms_variation - measurement_ms)))
      )

# Quick stats
fake_data
print(formatC(nrow(fake_data), format = "e", digits = 2))
print(summary(fake_data))

# Display object memory usage in MiB using two methods
print(t(rbind(
   round(sort(sapply(mget(ls()), object.size), decreasing = TRUE) / 1024^2, 1),
   round(sort(sapply(mget(ls()), object_size), decreasing = TRUE) / 1024^2, 1)
)))

print(gc())

With num_events <- 6e4, fake_data contains 120 million rows, and my calls to display memory usage give me this:

             [,1]   [,2]
fake_data  3662.1 3662.1
num_events    0.0    0.0

            used   (Mb) gc trigger   (Mb)   max used   (Mb)
Ncells    710701   38.0    1314908   70.3    1314908   70.3
Vcells 481252381 3671.7 1213867985 9261.1 1201720743 9168.5

When I call nls (takes 5-10 minutes on my computer, YMMV):

fake_nl_model <- nls(formula = measured_level ~ m_min + 
                         m_max / (1 + exp(m_slope * (m_mid - measurement_ms))),
                  data = fake_data,
                  start = list(m_min = 0, m_max = 11, m_slope = 0.01, m_mid = 500))

... my "Working set (memory)" in Windows Task Manager peaks at over 26 GB. And my memory dump at the end:

print(t(rbind(
   round(sort(sapply(mget(ls()), object.size), decreasing = TRUE) / 1024^2, 1),
   round(sort(sapply(mget(ls()), object_size), decreasing = TRUE) / 1024^2, 1)
)))

print(gc())

...gives me:

                [,1]    [,2]
fake_data     3662.1 13275.2
fake_nl_model    0.1  3662.1
num_events       0.0     0.0

             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells     720944    38.6    1314908    70.3    1314908    70.3
Vcells 2221274748 16947.0 5605265951 42764.8 5581275937 42581.8

According to this, pryr::object_size() "is better than the built-in object.size() because it accounts for shared elements within an object and includes the size of environments." That raises a secondary question... why does a call to nls dramatically increase the memory taken up by fake_data, if fake_data is not supposed to be changed by nls? I don't know if this is a clue or a red herring.

Anyway, I get crashes with messages like "Error: cannot allocate vector of size 4.2 Gb" when num_events is somewhere above 5e4, but obviously YMMV. I've tried both R 4.2.1 and Microsoft R Open 4.0.1 with similar results. I am open to alternatives to nls, but I have experienced the same problem with minpack.lm::nlsLM. I don't want to use a sample or aggregate subtotals -- I want the full dataset. I can get more RAM, but with a memory inefficiency this large it's hard to know how much more will suffice as the dataset grows.

Joel Buursma
  • 118
  • 6
  • ```nls``` doesn't seem to be utilizing Windows' virtual memory (I currently have 47 GB of virtual memory allocated). [gslnls::gsl_nls_large](https://cran.r-project.org/package=gslnls) _does_ seem to be using virtual memory, so that's a potential alternative, or perhaps improving Base R's use of virtual memory ([for example](https://stackoverflow.com/questions/39876328/forcing-r-and-rstudio-to-use-the-virtual-memory-on-windows)). I'm still surprised, though, that the memory usage of all these functions is so many times greater than my dataset's size. – Joel Buursma Sep 30 '22 at 22:36
  • [This](https://stackoverflow.com/questions/25049175/r-memory-issues-for-extremely-large-dataset) may be related. There are some cases with R where you can comfortably load the entire dataset into RAM, but subsequent operations seem to make wildly inefficient use of memory. For example, datasets in the 1-10 GB RAM range. – Joel Buursma Oct 03 '22 at 13:12

0 Answers0