1

I use the stat_smooth() function in ggplot to graph large data sets. It works fine until I have more than 100,000 rows. Then it returns the error:

'Calloc' could not allocate memory (18446744073673801728 of 8 bytes)

I am working on a server with 48 GB and watching task manager I still have memory available.

There was a similar question from ctree() function: "'Calloc 'could not allocate memory" in 64-bit R

Is my problem arising from a limitation in stat_smooth() or ggplot()? Has anyone else tried to run large data sets in either function? Do you have the same issue or have you had success?

markus
  • 25,843
  • 5
  • 39
  • 58
Jordan
  • 31
  • 3

1 Answers1

2

I found this from Dennis Murphy on another forum (https://groups.google.com/forum/#!topic/ggplot2/enavD18MmkY):

Hi:

To expand on Hadley's comments, I would suggest that you consult ?loess and read the chapter in the White book that it cites to gain a better understanding of how the loess procedure operates. It makes no sense to me why one would want to use a span of 1 to fit a loess model with 40K observations.

By its nature, loess is a "local regression" algorithm, with emphasis on the "local". The span argument controls the proportion of the data that should be used to produce each local fit - the wider the span, the smoother the fitted function. If you look carefully at the algorithm itself, you'll discover that it's VERY memory intensive, so if you insist on a loess fit with a large sample size, at least reduce the span. Here is an example to illustrate that you can indeed fit a loess model in ggplot2 with 40K observations.

x <- seq(0, 100, length.out = 40000) 
# A periodic function 
DF <- data.frame(x = x, y = 1 + sin(x) + 0.5 * cos(2 * x) + rnorm(40000)) 

library(ggplot2) 

# Uses the default "auto" method to which Hadley referred: 
ggplot(DF, aes(x = x, y = y)) + 
   geom_point(alpha = 0.05, shape = 21) + 
   geom_smooth(size = 1) 

The result of this [gam] fit, which finishes rather quickly, is more or less equivalent to a loess model with a large span (such as 1), but far more computationally efficient. The periodicity is almost completely ignored in the fitted curve as it has been averaged away. To capture the periodicity with a local regression algorithm, you need to reduce the proportion of the data devoted to each local fit. The following call takes about 1.5-2 minutes (guesstimated) to run, but it does produce a loess fit in the end on my laptop (with 12Gb RAM + R-3.2.0 64-bit + i7 chip):

ggplot(DF, aes(x = x, y = y)) + 
   geom_point(alpha = 0.05, shape = 21) + 
   geom_smooth(method = "loess", span = 0.1, size = 1) 

When I ran this in the R GUI, I got the "Not responding" message while R was cranking away mightily, but eventually a graph did appear.

You should be able to get a more accurate local fit by reducing the span further, since span = 0.1 in this example means that it's using approximately 4000 points per local fit, which is far more than it needs for a curve this simple in form. The following call took about 8-10 seconds, with one difference in the specification:

ggplot(DF, aes(x = x, y = y)) + 
   geom_point(alpha = 0.05, shape = 21) + 
   geom_smooth(method = "loess", span = 0.005, size = 1) 

In this call, span = 0.005 means that approximately 200 observations are used in each local fit, which is still fairly large. I would recommend experimenting with slightly smaller and larger spans to see how it affects the fitted loess model. The choice of span should be informed by the number of rows in the input data frame, the shape of the noisy input function and the degree of smoothness desired.

The example was deliberately chosen to illustrate why the choice of span matters in loess. On the other hand, the error message you received indirectly signals that loess is a memory hog and you need to know enough about how it works as an algorithm in order to use it productively.

Dennis

When I reduce my span value, my code runs without error. Even with a span of 0.1 my last set took 2 hours to run. I tried with a span of 0.01 and got an error to increase my span.

Jordan
  • 31
  • 3