2

I am trying to run a fixed effects regression in R. When I run the linear model without the fixed effects factor being applied the model works just fine. But when I apply the factor - which is a numeric code for user ID, I get the following error:

Error in rep.int(c(1, numeric(n)), n - 1L) : cannot allocate vector of length 1055470143

I am not sure what the error means but I fear it may be an issue of coding the variable correctly in R.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
vijkrishb
  • 23
  • 2
  • That sort of error usually means you are running out of working memory. It appears R is trying to allocate a vector of 1 billion values in your instance. – thelatemail Jul 11 '13 at 05:25
  • Should I recode the user ID variable as a string then? Would that make a difference? My dataset is only around 30K observations. – vijkrishb Jul 11 '13 at 05:29
  • You are going to have to provide a little more information about the code you are using so we can figure out the issue. What is your regression call that works and what is the one that doesn't work? e.g. `glm(y ~ x)`. Are you also able to provide a sample of your data:? eg, `dput(head(putdatanamehere))` – thelatemail Jul 11 '13 at 05:36
  • Before doing any modelling, you should think about what the user ID variable represents, and whether it'll actually add anything to the model. – Hong Ooi Jul 11 '13 at 05:38
  • `reg_lm = lm(Y~ X1 + X2 + X3 + X4 + X5 + X6 + X7 +factor(user_id), data=input_reg)` – vijkrishb Jul 11 '13 at 05:42
  • Sorry hit enter before I finished typing it all. The model looks something like this: `reg_lm = lm(Y~ X1 + X2 + X3 + X4 + X5 + X6 + X7 +factor(user_id), data=input_reg)` When I run this without the "factor(user_id)" it works just fine and I get results that close to expectation (so the simple linear model works just fine). The reason I want to account for the user ID variable is that there is A LOT of variability that we see for the observed Y depending on the user ID so I am trying to control for it and see if anything meaningful can be gleaned from that. Hope that helps. – vijkrishb Jul 11 '13 at 05:48
  • Please edit your answer with this additional code. – Roman Luštrik Jul 11 '13 at 06:10
  • Hi Roman, I am not sure what additional code you are referring to. – vijkrishb Jul 11 '13 at 06:17

2 Answers2

1

I think this is more statistical and less programming problem for two reasons:

First, I am not sure whether you are using cross sectional data or panel data. If you using cross-sectional data it doesn't make sense to control for 30000 individuals(of course, they will add to variation).

Second, if you are using panel data, there are good package such as plm package in R that does this kind of computation.

Metrics
  • 15,172
  • 7
  • 54
  • 83
  • Yes, you're right. I spent the night digging into the data (the sample was provided to me) and I realize now that while I thought the individuals user IDs were sampled (so that there'd be a few 1000 users accounting for the 30K observations) it turns out I had close to 30K individuals so yes, the data turned out to be cross-sectional. I am currently trying to go back to the source data and sample it correctly and I think it should run cleanly after that. – vijkrishb Jul 11 '13 at 17:21
  • Why not accept this as answer so that it will be useful for future users? – Metrics Jul 11 '13 at 17:42
  • OH sorry!Thought I did, we've cleaned out our data sample and just plm is working now. :) – vijkrishb Jul 11 '13 at 22:18
0

An example:

set.seed(42)
DF <- data.frame(x=rnorm(1e5),id=factor(sample(seq_len(1e3),1e5,TRUE)))
DF$y <- 100*DF$x + 5 + rnorm(1e5,sd=0.01) + as.numeric(DF$id)^2

fit <- lm(y~x+id,data=DF)

This needs almost 2.5 GB RAM for the R session (if you add RAM needed by the OS this is more than many PCs have available) and takes some time to finish. The result is pretty useless.

If you don't run into RAM limitations you can suffer from limitations of vector length (e.g., if you have even more factor levels), in particular if you use an older version of R.

What happens?

One of the first steps in lm is creating the design matrix using the function model.matrix. Here is a smaller example of what happens with factors:

model.matrix(b~a,data=data.frame(a=factor(1:5),b=2))

#   (Intercept) a2 a3 a4 a5
# 1           1  0  0  0  0
# 2           1  1  0  0  0
# 3           1  0  1  0  0
# 4           1  0  0  1  0
# 5           1  0  0  0  1
# attr(,"assign")
# [1] 0 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$a
# [1] "contr.treatment"

See how n factor levels result in n-1 dummy variables? If you have many factor levels and many observations, this matrix gets huge.

What should you do?

I'm quite sure, you should use a mixed effects model. There are two important packages that implement linear mixed effects models in R, package nlme and the newer package lme4.

library(lme4)

fit.mixed <- lmer(y~x+(1|id),data=DF)
summary(fit.mixed)

Linear mixed model fit by REML 
Formula: y ~ x + (1 | id) 
Data: DF 
    AIC     BIC  logLik deviance REMLdev
1025277 1025315 -512634  1025282 1025269
Random effects:
  Groups   Name        Variance   Std.Dev. 
id       (Intercept) 8.9057e+08 29842.472
Residual             1.3875e+03    37.249
Number of obs: 100000, groups: id, 1000

Fixed effects:
             Estimate Std. Error t value
(Intercept) 3.338e+05  9.437e+02   353.8
x           1.000e+02  1.180e-01   847.3

Correlation of Fixed Effects:
  (Intr)
x 0.000

This needs very little RAM, calculates fast, and is a more correct model.

See how the random intercept accounts for most of the variance?

So, you need to study mixed effects models. There are some nice publications, e.g. Baayen, Davidson, Bates (2008), explaining how to use lme4.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Turns out the sample provided was not really randomized and we got only one observation per user, had to fix that and we're now able to run FE models, though I'll take a read of the ME model as well. – vijkrishb Jul 11 '13 at 22:19