11

This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.

Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.

It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.

e.g.

db.connection <- db.connection.object


build_model_list <- function(db.connection) {   


clean_and_build_models <- function(db.connection, other.parameters) {


get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined

db.data <- get_db_data() 


build_models <- function(db.data, some.parameters) ## Externally defined

clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined


clean.data <- clean_data() 


lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined

lm.model <- lm_model()


return(list(lm.model, other.parameters))} ## Externally defined


looped.model.object <- llply(some.parameters, clean_and_build_models)

return(looped.model.object)}


model.list <- build_model_list()

saveRDS(model.list, "~/a_place/model_list.RDS")

The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.

I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.

Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.

This however has some big disadvantages.

Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.

Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.

My question to you is:

Has anyone encountered this issue before?

Any hypotheses as to what's causing it?

Has anyone found a logical non-trial-and-error solution to this?

Thanks for your help.

IanCognito
  • 341
  • 2
  • 8
  • Possibly related to this http://r.789695.n4.nabble.com/Model-object-when-generated-in-a-function-saves-entire-environment-when-saved-td4723192.html – kennyB Mar 19 '17 at 02:23

4 Answers4

13

It took a bit of digging but I did actually find a solution in the end.

It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:

https://blogs.oracle.com/R/entry/is_the_size_of_your

It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.

As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.

To test if this is what's causing the problem. Execute the following code:

as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL)))) 

This will tell you if the $terms component is inflating.

The following code will remove the environmental references from the $terms component:

rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment")) 

Be warned though it'll also remove all the global environmental objects it references.

IanCognito
  • 341
  • 2
  • 8
  • 1
    I am experiencing a similar problem trying to save a model object - has anyone coem a accors a differnt solution? – mhwh Sep 17 '18 at 14:43
9

For model objects you could also simply delete the reference to the environment.

As for example like this

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group) 

attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")

This unfortunatly breaks the model - but you can add an environment manualy after loading again.

readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()

This helped me in my specific use case and looks a bit saver to me...

mhwh
  • 478
  • 4
  • 12
0

Neither of these two solutions worked for me.

Instead I have used:

downloaded_object <- storage_download(connection, "path") 
read_RDS <- readRDS(downloaded_object)  
0

The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.

mm <- felm(formula=formula, data=data, keepX=TRUE, ...)

# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)

attr(mm_cp$terms, ".Environment") <- NULL

saveRDS(mm_cp, file = "path_to_save.RDS")

Then when we need to use it:

mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()

In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?). Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.

eyjo
  • 1,180
  • 6
  • 8