7

I am trying to build an R project that generates multiple ggplot2 objects using functions. However, I noticed that, when saving these objects as RDS files, the file sizes are much larger than I expected. I realized that saving an RDS object generated with a function, and the same plot in the global environment, give two very different file sizes, despite occupying equivalent memory in the R session. For example:

library(ggplot2)
data <- data.frame(x = rnorm(1e6))

p1 <- ggplot(data) + 
  geom_histogram(aes(x = x))

plot_fun <- function(y) {
  p <- ggplot(y) +
    geom_histogram(aes(x = x))
  return(p)
}

p2 <- plot_fun(data)

object.size(p1) # 8 Mb
object.size(p2) # 8 Mb

saveRDS(p1, "plot1.rds")
saveRDS(p2, "plot2.rds")

file.info("plot1.rds", "plot2.rds")

Does anyone know why this happens? Am I returning the object incorrectly from the function?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • See the solution proposed here (with code correction in the comments) for plots. It should also work for this case. https://stackoverflow.com/questions/32192298/small-ggplot-object-1-mb-turns-into-7-gigabyte-rdata-object-when-saved/57315001#57315001 – HoneyBuddha Aug 01 '19 at 20:31

4 Answers4

10

This one is tricky. My initial advice was to use pryr::object_size(), which is more thorough about including the size of objects stored in the environment of an object, but that shows only a tiny difference between the two ggplot objects.

However, ggplot objects contain an environment, the $plot_env component, the contents of which will get stored along with the object.

The environment of p2$plot_env is that corresponding to the inside of your function:

ls(p2$plot_env)
# [1] "p" "y"

while the environment of p1$plot_env is the global environment, which contains a copy of the data as well as the other plot object ...

ls(p1$plot_env)
# [1] "data"     "p1"       "p2"       "plot_fun"

But this still seems a bit mysterious to me. p1 (with more objects in its environment) creates the smaller file size (7.4M), while p2 (with fewer objects) creates the larger file size (22M), and p1 naively seems to have more stuff stored:

sapply(p1$plot_env,object.size)
## plot_fun       p1       p2     data 
##     6568  8004632  8004632  8000728 
sapply(p2$plot_env,object.size)
##       p       y 
## 8004632 8000728 

Is this some kind of recursive nightmare where environments are referencing other environments, which all have to get stored? As @Chris says:

p2's environment has a parent environment of the global environment, while p1's environment is the global environment...I imag[in]e what is happening is that, when R needs to serialize an environment that inherits from another env (i.e., a parent env), it saves the parent env along with the child. That would explain why saving p1 would result in a smaller file size as compared to p2

If I replace the plotting environment of p2 with the global environment, the file size does get smaller ... and I think I didn't break the plotting object.

p2$plot_env <- p1$plot_env
saveRDS(p2, "plot2.rds")
system("ls -lht plot?.rds")
## -rw-r--r--  1 bolker  staff   7.4M 15 Jun 20:15 plot2.rds
## -rw-r--r--  1 bolker  staff   7.4M 15 Jun 20:14 plot1.rds

If your workflow allows it, you might consider storing rendered versions of these plots (as PDF/SVG/whatever) rather than the plot objects themselves ... although the plot objects are certainly more flexible.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 1
    Hey Ben, `p2`'s environment has a parent environment of the global environment, while `p1`'s environment is the global environment...I image what is happening is that, when R needs to serialize an environment that inherits from another env (i.e., a parent env), it saves the parent env along with the child. That would explain why saving `p1` would result in a smaller file size as compared to `p2`. – Chris Jun 16 '18 at 00:21
  • yep, looks like you're on it! – Chris Jun 16 '18 at 00:22
  • Thanks for the advice! I am mostly trying to migrate my code into a drake-managed workflow, which advises storing outputs as objects, rather than as file outputs. I think this makes sense, as I will eventually like to incorporate these plots in markdown docs as well as rendered graphics files, and retain flexibility to adjust plotting parameters. Based on what you and Chris mentioned, I think I can fix this by setting a NULL environment when calling ggplot: ggplot(y, environment = NULL). This seems to do the trick. Thanks again! – bc_thaliana Jun 18 '18 at 15:59
  • the only concern would be that you could conceivably break a plot that depends on something in its environment - don't know when this would happen – Ben Bolker Jun 18 '18 at 18:39
  • 1
    Correction: the environment = NULL doesn't work in all cases; removing unneeded objects from the function environment before creating the plot object does help, though. – bc_thaliana Jun 18 '18 at 18:43
  • @BenBolker good point; just have to make sure everything in the call to ggplot is contained in data supplied to the function, or if removing objects, that I don't get rid of something important. – bc_thaliana Jun 18 '18 at 18:46
  • Regarding saving rendered versions of the plots, is there a way to do that within the rds, rather than saving an image to a separate file than the rds? E.g. A simplified plot object that is basically an image, with a print method that prints the plots panel. – dule arnaux Mar 22 '19 at 17:13
3

If you want to get an accurate size for your object, use: length(serialize(p1,NULL)). As stated above, this difference comes from the environments.

T.Gulea
  • 101
  • 1
  • 3
1

Apologies for coming to this question 4 years late - it's been very useful to me today while trying to work out what some code of mine is doing.

I think I can add something on the difference between the size of p1 and p2. I think p1 is smaller specifically because the environment is GlobalEnv:

  • When the plotting code is wrapped in a function, the function environment is a newly created environment, and plot_env seems to store its entire contents.
  • However, when it is not, the environment is just GlobalEnv, and in this case plot_env seems to store just a reference to GlobalEnv.

See an example to illustrate this here:

library(ggplot2)

# Create a plot inside a function ============================
test_plot <- function(a, b = 2){
  dat <- data.frame("a" = a, "b" = b)
  p <- ggplot(dat, aes(x = a, y = b)) + geom_point()
  return(p)
}
p_function <- test_plot(a = c(45, 46))
saveRDS(p_function, "plot_function.RDS")

# Create the same plot in the GlobalEnv ==================
a <- c(45, 46)
b <- 2
dat <- data.frame("a" = a, "b" = b)
p_free <- ggplot(dat, aes(x = a, y = b)) + geom_point()
saveRDS(p_free, "plot_free.RDS")

# Now read them back in =================================
p_function_read <- readRDS("plot_function.RDS")
p_free_read <- readRDS("plot_free.RDS")

# Examine their environments ===========================
print(names(p_function_read$plot_env))
print(names(p_free_read$plot_env))
# Note that p_free, p_free_read and p_function_read are all in the environment for p_free_read
# That seems weird, given they weren't even around when this plot was made!

# If we clear the environment, what happens?
rm(list = ls())
p_function_read <- readRDS("plot_function.RDS")
print(names(p_function_read$plot_env))

rm(list = ls())
p_free_read <- readRDS("plot_free.RDS")
print(names(p_free_read$plot_env))

# So p_function_read still has everything in its environment from before. But p_free_read has only itself.

# We can further check by adding something fresh to the global environment
hello <- 3
print(names(p_free_read$plot_env))

Do correct me where needed - I'm very new to working with environments at all. Really this should be a comment on the above, but I don't have enough reputation for that...

r_epi
  • 11
  • 2
0

Investigating the ggplot2 object with the help of length(serialize(x,NULL)) I found large environment data in several locations, removing them reduces the RDS file size and from what I can tell does not negatively affect the saved object. With x as my ggplot2 plot that used color mapping when making the plot, I did this:

x$plot_env <- globalenv()
attr(x$mapping$x, ".Environment") <- globalenv()
attr(x$mapping$y, ".Environment") <- globalenv()
attr(x$mapping$colour, ".Environment") <- globalenv()
attr(x$layers[[1]]$computed_mapping$x, ".Environment") <- globalenv()
attr(x$layers[[1]]$computed_mapping$y, ".Environment") <- globalenv()
attr(x$layers[[1]]$computed_mapping$colour, ".Environment") <- globalenv()

You may have to repeat for other elements if you have additional or different mapping variables.

Mike
  • 21
  • 3