R predict glm fit on each column in data frame using column index number

Question

Trying to fit BLR model to each column in data frame, and then predict on new data pts. Have a lot of columns, so cannot identify the columns by name, only column number. Having reviewed the several examples of similar nature on this site, cannot figure out why this does not work.

df <- data.frame(x1 = runif(1000, -10, 10),
                 x2 = runif(1000, -2, 2),
                 x3 = runif(1000, -5, 5),
                 y = rbinom(1000, size = 1, prob = 0.40))

for (i in 1:length(df)-1)
{
        fit <- glm (y ~ df[,i], data = df, family = binomial, na.action = na.exclude)

        new_pts <- data.frame(seq(min(df[,i], na.rm = TRUE), max(df[,i], na.rm = TRUE), len = 200))
        names(new_pts) <- names(df[, i])

        new_pred <- predict(fit, newdata = new_pts, type = "response")

}

The predict() function raises warning message and returns array 1000 elements long, whereas the test data has only 200 elements.

Warning message : Warning message: 'newdata' has 200 lines bu the variables found have 1000 lines

Manuel Bickel · Accepted Answer · 2018-08-12T20:50:43.387

For repeated modelling I use a similar approach as shown below. I have implemented it with data.table, but it could be rewritten to use the base data.frame (the code would then be more verbose, I guess). In this approach I store all the models in a separate object (below I have provided two versions of the code, one more explanatory part, and one more advanced aiming at a clean output).

Of course, you could also write a loop/function that only fits one model per iteration without storing them. From my perspective, its a good idea to save the models, since you probably will have to investigate the models for robustness, etc. and not only predict new values.

HINT: Please also have a look at the answer of @AndS. providing a tidyverse approach. Together with this answer, I think, this is certainly a nice side by side comparison for learning/understanding data.table and tidyverse approaches

# i have used some more simple data to show that the output is correct, see the plots
df <- data.frame(x1 = seq(1, 100, 10),
                 x2 = (1:10)^2,
                 y =  seq(1, 20, 2))
library(data.table)
setDT(df)
# prepare the data by melting it
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
# also i used a more simple model (in this case lm would also do)
# create model for each variable (formerly columns)
models = setnames(DT[, data.table(list(glm(y ~ x))), by = "variable"], "V1", "model")
# create a new set of data to be predicted
# NOTE: this could, of course, also be added to the models data.table
# as new column via `:=list(...)`
new_pts = setnames(DT[, seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), len = 200), by = variable], "V1", "x")
# add the predicted values
new_pts[, predicted:= predict(models[variable == unlist(.BY), model][[1]], newdata = as.data.frame(x),  type = "response")
        , by = variable]
# plot and check if it makes sense
plot(df$x1, df$y)
lines(new_pts[variable == "x1", .(x, predicted)])
points(df$x2, df$y)
lines(new_pts[variable == "x2", .(x, predicted)])

# also the following version of above code is possible
# that generates only one new objects in the environment
# but maybe looks more complicated at first sight
# not sure if this is the best way to do it
# data.table experts might provide some shortcuts
setDT(df)
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
DT = data.table(variable = unique(DT$variable), dat = split(DT, DT$variable))
DT[, models:= list(list(glm(y ~ x, data = dat[[1]]))), by = variable]
DT[, new_pts:= list(list(data.frame(x = dat[[1]][
                                                 ,seq(min(x, na.rm = TRUE)
                                                 , max(x, na.rm = TRUE), len = 200)]
                                    )))
       , by = variable]
models[, predicted:= list(list(data.frame(pred = predict(model[[1]]
                                          , newdata = new_pts[[1]]
                                          ,  type = "response")))),
       by = variable]
plot(df$x1, df$y)
lines(models[variable == "x1", .(unlist(new_pts), unlist(predicted))])
points(df$x2, df$y)
lines(models[variable == "x2", .(unlist(new_pts), unlist(predicted))])

AndS. · Answer 2 · 2018-08-12T20:42:12.243

The answer above does a great job. Here is another option for this sort of thing. First we take the dataframe from wide to long, next we nest the data by group, then we run a model per group, lastly we map out the predicted values from the models and unnest our dataframe. I plotted the predicted values to show that you get a reasonable result. Note that before we unnest the data, we keep the model within the dataframe and we can extract other information we need as well before unnesting.

library(tidyverse)

df <- data.frame(x1 = seq(1, 100, 10),
                 x2 = (1:10)^2,
                 y =  seq(1, 20, 2))

pred_df <- df %>% 
  gather(var, val, -y) %>% 
  nest(-var) %>% 
  mutate(model = map(data, ~glm(y~val, data = .)), 
         predicted = map(model, predict)) %>% 
  unnest(data, predicted)

p1 <- pred_df %>% 
  ggplot(aes(x = val, group = var))+
  geom_point(aes(y = y))+
  geom_line(aes(y = predicted))
p1

EDIT

Here we will keep the models in the data frame and then pull out extra info.

df %>% 
    gather(var, val, -y) %>% 
    nest(-var) %>% 
    mutate(model = map(data, ~glm(y~val, data = .)), 
           predicted = map(model, predict))
#   var   data              model     predicted 
# 1 x1    <tibble [10 × 2]> <S3: glm> <dbl [10]>
# 2 x2    <tibble [10 × 2]> <S3: glm> <dbl [10]>

Now we can pull out the other info we are interested in

df2 <- df %>% 
    gather(var, val, -y) %>% 
    nest(-var) %>% 
    mutate(model = map(data, ~glm(y~val, data = .)), 
           predicted = map(model, predict)) %>%
    mutate(intercept = map(model, ~summary(.x)$coefficients[[1]]),
           slope = map(model, ~summary(.x)$coefficients[[2]]))
df2
#   var   data              model     predicted  intercept slope    
# 1 x1    <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>
# 2 x2    <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>

Then we just unnest to pull out the values, but keep the rest of the info nested.

df2 %>% unnest(intercept, slope)
#   var   data              model     predicted  intercept slope
# 1 x1    <tibble [10 × 2]> <S3: glm> <dbl [10]>      0.8  0.200
# 2 x2    <tibble [10 × 2]> <S3: glm> <dbl [10]>      3.35 0.173

Another option is to make a function that maps all of the data that we want into a nested list and then we can pull out the elements we want as we need them

get_my_info <- function(dat){
    model <- glm(y~val, data = dat)
    predicted <- predict(model)
    intercept <- summary(model)$coefficients[[1]]
    slope <- summary(model)$coefficients[[2]]
    return(list(model = model,predicted = predicted, intercept = intercept, slope = slope))
}

df3 <- df %>% 
    gather(var, val, -y) %>% 
    nest(-var) %>% 
    mutate(info = map(data, get_my_info))
df3
#   var   data              info      
# 1 x1    <tibble [10 × 2]> <list [4]>
# 2 x2    <tibble [10 × 2]> <list [4]>

and if we want to pull out the predicted values

df3 %>% mutate(pred = map(info, ~.x$predicted))
#   var   data              info       pred      
# 1 x1    <tibble [10 × 2]> <list [4]> <dbl [10]>
# 2 x2    <tibble [10 × 2]> <list [4]> <dbl [10]>

Nice alternative. I am not very familiar with tidyverse and would be interested how you would store the models along with the predicted values, etc. in a tidyverse approach. I would appreciate if you find the time to add this to your answer. — Manuel Bickel, Aug 12 '18 at 19:03
Not a problem. I’ll update in a few minutes, but it’s as easy as omitting the unnest command at the end. — AndS., Aug 12 '18 at 19:40
Just updated. I found the whole nesting concept tricky, but It actually does make things easier, especially when you're applying functions to groups that do not output the same number of rows/elements as the dataset. — AndS., Aug 12 '18 at 20:44
Thank you for the detailed update! This is a nice side by side comparison of the different approaches, which I will certainly come back to when dealing with tidyverse approaches. I hope the votes will more or less be equal between the answers. I have updated my answer with a hint to have a look at yours as well. — Manuel Bickel, Aug 12 '18 at 20:51

R predict glm fit on each column in data frame using column index number

2 Answers2

Linked