I'm running a 3.5 GB data frame through the BaseR nls
function trying to fit a sigmoid function and getting out of memory errors on a Windows computer with 32 GB of RAM. In short, I'm wondering why the nls
function needs so much RAM to complete and if there's anything I can do about it.
Here is my code to create fake data:
print(num_events <- 6e4) # increase or decrease to test crashing
set.seed(10)
library(tidyr)
library(dplyr)
library(pryr) # for object_size
fake_data <- expand_grid( # get all combinations of below vectors
event_num = 1:num_events,
typical_ms_variation = seq(-100, 100, by = 50),
typical_level_variation = seq(0.8, 1.2, by = 0.1)
) %>% inner_join(
data.frame( # each event has some random variation
event_num = 1:num_events,
event_ms_variation = runif(num_events, -100, 100),
event_level_variation = runif(num_events, 0.9, 1.1)
),
by = "event_num"
) %>%
mutate(
ms_variation = typical_ms_variation + event_ms_variation,
level_variation = typical_level_variation * event_level_variation
) %>%
select(-contains("typical"), -contains("event")) %>%
expand_grid(# get all combinations of vectors below
measurement_ms = seq(1, 2000, by = 25)
) %>%
mutate( # sigmoid function
measured_level = 0 + 10 * level_variation /
( 1 + exp(0.015 *(800 + ms_variation - measurement_ms)))
)
# Quick stats
fake_data
print(formatC(nrow(fake_data), format = "e", digits = 2))
print(summary(fake_data))
# Display object memory usage in MiB using two methods
print(t(rbind(
round(sort(sapply(mget(ls()), object.size), decreasing = TRUE) / 1024^2, 1),
round(sort(sapply(mget(ls()), object_size), decreasing = TRUE) / 1024^2, 1)
)))
print(gc())
With num_events <- 6e4
, fake_data
contains 120 million rows, and my calls to display memory usage give me this:
[,1] [,2]
fake_data 3662.1 3662.1
num_events 0.0 0.0
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 710701 38.0 1314908 70.3 1314908 70.3
Vcells 481252381 3671.7 1213867985 9261.1 1201720743 9168.5
When I call nls
(takes 5-10 minutes on my computer, YMMV):
fake_nl_model <- nls(formula = measured_level ~ m_min +
m_max / (1 + exp(m_slope * (m_mid - measurement_ms))),
data = fake_data,
start = list(m_min = 0, m_max = 11, m_slope = 0.01, m_mid = 500))
... my "Working set (memory)" in Windows Task Manager peaks at over 26 GB. And my memory dump at the end:
print(t(rbind(
round(sort(sapply(mget(ls()), object.size), decreasing = TRUE) / 1024^2, 1),
round(sort(sapply(mget(ls()), object_size), decreasing = TRUE) / 1024^2, 1)
)))
print(gc())
...gives me:
[,1] [,2]
fake_data 3662.1 13275.2
fake_nl_model 0.1 3662.1
num_events 0.0 0.0
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 720944 38.6 1314908 70.3 1314908 70.3
Vcells 2221274748 16947.0 5605265951 42764.8 5581275937 42581.8
According to this, pryr::object_size()
"is better than the built-in object.size()
because it accounts for shared elements within an object and includes the size of environments." That raises a secondary question... why does a call to nls
dramatically increase the memory taken up by fake_data
, if fake_data
is not supposed to be changed by nls
? I don't know if this is a clue or a red herring.
Anyway, I get crashes with messages like "Error: cannot allocate vector of size 4.2 Gb"
when num_events
is somewhere above 5e4, but obviously YMMV. I've tried both R 4.2.1 and Microsoft R Open 4.0.1 with similar results. I am open to alternatives to nls
, but I have experienced the same problem with minpack.lm::nlsLM
. I don't want to use a sample or aggregate subtotals -- I want the full dataset. I can get more RAM, but with a memory inefficiency this large it's hard to know how much more will suffice as the dataset grows.