0

I am trying to run a basic regression model in R. Previously, I always used the lm() function without any issues. However, my data frame is now too large for this function and my computer. After running the lm() function for 30 minutes on my dataset without seeing any progress, I stopped the function and it crashed RStudio. The computer I am using has 24GB RAM.

My regression model is:

lm(y~var1+var2+var3+var4, data = df)

The data I am trying to run the lm() function on is: n=100000, with 4 independent variables (one numeric, three factor) and normally distributed.

I found out that using the glm4() function (from the MatrixModels package) is a lot faster and does not crash R in my case. However, this function does not produce a summary table when calling it:

library(MatrixModels)

fit <- glm4(y~var1+var2+var3+var4, data = df, sparse = TRUE, family = gaussian)

summary(fit)
  Length    Class     Mode 
       1 glpModel       S4

Only calling coefficients using head(coef(fit)) does work, however, I would prefer a full summary table.

head(coef(fit))

I also saw this topic: Is there a faster lm function, in which the functions lm.fit() and .lm.fit() are discussed, though the syntax and input (matrix) in these functions is different from the other functions. The function speedglm from the speedglm package returns an error in my case. Most topics on alternatives of the lm() and glm() function are also outdated.

What is the best way to run an lm() on a large dataset currently?

nathan liang
  • 1,000
  • 2
  • 11
  • 22
M1ke
  • 67
  • 8
  • Why won't the answer on the existing question work for you? This seems like a duplicate as there's really no new information here that would change the answers you already found. You'll just need to test with your data and determine what works for you. There is no clear definition for "best". Use what works for you. – MrFlick Mar 30 '22 at 19:24
  • 1
    Can you spell out more what you're doing? I constructed fake data like yours for n = 100k and ran your `lm(y...` line and it took ~1 second, and n=10M took about 25 seconds. Can you reproduce the problem with fake generated data? – Jon Spring Mar 30 '22 at 19:32
  • 1
    100k isn't really that big at all I'd expect ``lm`` to deal perfectly fine with that unless your processor is 20 years old. – user438383 Mar 30 '22 at 19:33
  • Thanks all for your replies. It seems that the error is in my dataset, not in my computer (it’s an ~8 year old 3.5ghz i7 with 24GB RAM) or R function. I have never had any issues with regressions with smaller datasets. I have exported the columns I need for the regression to a .csv and then imported it again, now the regression runs in less than 2 seconds as well. If I run the regression on the same exact columns in my main data frame, lm() fails to run strangely enough (it hangs). – M1ke Mar 30 '22 at 19:50
  • 3
    How many unique factors do you have per variable? If there many different factors you are not solving 1 linear regression but potential 10's of thousands of regressions. – Dave2e Mar 30 '22 at 22:14
  • @Dave2e Yes, you are right, one of the variables was accidentally specified as a character instead of factor. Sorry for the amateurish mistake. – M1ke Apr 01 '22 at 10:06

1 Answers1

1

Apparently, it should not be a problem to run a regression on a dataset of ~100,000 observations.

After receiving helpful comments on the main post, I found that one of the independent variables used in the input of the regression was coded as a character, by using the following command to find the data type of every column in the dataframe (df):

str(df)

$ var1           : chr  "x1" "x2" "x1" "x1"
$ var2           : Factor w/ 2 levels "factor1" "factor2": 1 1 1 0
$ var3           : Factor w/ 2 levels "factorx" "factory": 0 1 1 0
$ var4           : num 1 8 3 2

Changing var1 to a factor variable:

df$var1 <- as.factor(df$var1)

After changing var1 to a factor variable, the regression indeed runs within a few seconds.

M1ke
  • 67
  • 8