Optimize my regression using vectorization instead of a for loop

Question

How would I vectorize this loop? When I have the loop with the backward stepwise regression, it takes over 15 minutes to run through the regression. (My full dataset has over 4000 observations and 20+ independent variables.) Any idea how I would vectorize this? I'm new to the whole concept.

I've looked into making this a function, and then using an ifelse statement for the training and validation. But, I haven't been able to get this to work in the code. Any ideas?

Here is a small dataset:

name <- c("Joe I.", "Joe I.", "Joe I.", "Joe I.", "Jane P.", "Jane P.", "Jane P.", "Jane P.", 
          "John K.", "John K.", "John K.", "John K.") 
name_id <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
grade <- c(80, 99, 70, 65, 88, 90, 76, 65, 67, 68, 89, 67)
score <- c(82, 93, 72, 61, 89, 93, 71, 63, 64, 65, 82, 62)
attendance <- c(80, 99, 82, 62, 70, 65, 88, 90, 76, 93, 71, 99)
participation <- c(71, 63, 64, 71, 99, 76, 65, 67, 93, 72, 68, 89)

df <- cbind(name, name_id, class, grade, score, attendance, participation)
df <- as.data.frame(df)

df$name_id <- as.numeric(df$name_id)
df$grade <- as.numeric(df$grade)
df$score <- as.numeric(df$score)
df$attendance <- as.numeric(df$attendance)
df$participation <- as.numeric(df$participation)

Here is the loop:

magic_for(print, silent = TRUE)
for(i in 1:3){
  validation = df[df$name_id == (i),]
  training = df[df$name_id != (i),]
  m = lm(score ~ grade + attendance, participation, data = training)
  stepm <- stepAIC(m, direction = "backward", trace = FALSE)
  pred1 <- predict(stepm, validation)
  print(pred1)
}
options(max.print=999999)
pred1 <- magic_result_as_dataframe()

The only expensive parts of your loop are the calls to `lm` and `stepAIC`. There are faster alternatives to `lm`. See here: https://stackoverflow.com/questions/25416413/is-there-a-faster-lm-function As for `stepAIC`, consider similar functions in the glmnet package, which is built on FORTRAN and is fast (http://www.talkstats.com/threads/faster-stepwise-selection-for-a-large-data-set.52674/) — jdobres, Nov 27 '19 at 04:15

score 0 · Answer 1 · answered Nov 27 '19 at 04:12

I am not sure if the following code can speed up your program, please have a try. Here df is pre-processed to be splitted by df$name_id, such that you have different chunks in terms of name_id

dfs <- split(df,df$name_id)
lapply(seq_along(dfs), function(k) {
  validation <- dfs[[k]]
  m <- lm(score ~ grade + attendance, participation, data = Reduce(rbind,dfs[-k]))
  stepm <- stepAIC(m, direction = "backward", trace = FALSE)
  pred1 <- predict(stepm, validation)
})

`lapply` is no faster than `for`-loops. – IRTFM Nov 27 '19 at 17:16 — IRTFM, Nov 27 '19 at 17:16

Optimize my regression using vectorization instead of a for loop

1 Answers1