Writing loop/function to generate various linear regressions on same dataframe

Question

I am writing loops or functions in R, and I still haven't really understood how to do that. Currently, I need to write a loop/function (not sure which one would be better) to create several linear regression models within the same data frame.

I have data like this:

dataset <- read.table(text = 
"ID  A_2 B_2 C_2 A_1 B_1 C_1 AGE
M1  10  6   6   8   8   9   25
M2  50  69  54  67  22  44  16
M3  5   80  44  78  5   55  18
M4  60  70  52  89  3   56  28
M5  60  5   34  90  80  56  34
M6  55  55  67  60  100 77  54", header = TRUE, stringsAsFactors = FALSE)

I am building models like this:

model1 <- lm(A_2~A_1+age, data=dataset)

model2 <- lm(B_2~B_1+age, data=dataset)

model3 <- lm(C_2~C_1+age, data=dataset)

I need to write a loop which:

takes variable _2 (the dependent variable) and variable _1 (independent variable) and covariates like age ...
creates the lm models, and stores outputs (i.e, T-value, p-value, confidence intervals etc) in a data.frame that I can then print.

Dep_va  Ind_var Convarites  Pvalue  "upper.cI" "low.cI" 

A_2 A_1 age         
B_2 B_1 age         
C_2 C_1 age         
D_2 D_1 age

Hi vinoth. It helps if you add a sample of your data (in r code). You can use the dput function on your data object, then paste the result in the question. — Luis, May 13 '19 at 21:41
See https://stackoverflow.com/questions/56077099/regarding-multiple-linear-models-simultaneously/56078311 — G. Grothendieck, May 13 '19 at 21:58
For automating arbitrary column/variable references, it's typically easier to use the column-slicing syntax in `lm` instead of the formula interface `lm(A_2~A_1+age, ...)`. In that case you might pass (vectors of) column-indices into your function instead of (string) variable names. Vectors of column-indices are better than string names, since you can use arbitrary number of variables; a tiny downside is it might make things less legible. — smci, May 13 '19 at 22:01
Related: [Print R-squared for all of the models fit with lmList](https://stackoverflow.com/questions/23501852/print-r-squared-for-all-of-the-models-fit-with-lmlist), [Generate an array of regression models without for loop](https://stackoverflow.com/questions/39498490/generate-an-array-of-regression-models-without-for-loop). This is a good question but seems like a duplicate; although throwing the package reference `lme4/nlme`::`lmList` into a question without explaining what problem it solves is not a good canonical wording, and we shouldn't hardcode methodology questions to specific packages — smci, May 13 '19 at 22:05
@smci actually i am trying to build mixed model and simple linear model and automated code to select the variable_2 as dependent variable and Variable_1 as the independent variable and run the model and get the p-value and cI of variable — vka, May 13 '19 at 22:13
Right, how does that disagree with what I wrote? The question has essentially been asked in many prior incarnations, not verbatim but essentially the same question. And I recommended above you **stop referring to variables by name** (`variable_2` as dependent variable and `Variable_1` as the independent variable), and **start referring to them simply as vectors of column-indices in your dataframe**. — smci, May 13 '19 at 22:14

Dij · Answer 1 · 2019-05-13T23:19:32.227

0

Given your particular question, where you have variable names with common bases that are labelled with _2 and _1 to designate the dependent and independent variables respectively, you could solve your issue as follows:

var.names <- names(dataset[!names(dataset) %in% c("ID","AGE")])
names.list <- strsplit(var.names,split = "_")
list.of.models <- list()
for (i in 1:length(names.list)) {
DV <- grep(names.list[[i]][1], names(dataset))[1]
IV <- grep(names.list[[i]][1], names(dataset))[2]
  list.of.models[[i]] <- lm(dataset[,DV] ~ dataset[,IV] + AGE, data = dataset)
}

summary(list.of.models[[1]])

Call:
lm(formula = dataset[, DV] ~ dataset[, IV] + AGE, data = dataset)

Residuals:
        1         2         3         4         5         6 
  0.07496  20.42938 -31.36213  10.04093   4.47412  -3.65725 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)        -15.0439    31.5482  -0.477    0.666
dataset[, IV]        0.4627     0.3319   1.394    0.258
AGE                  0.8507     0.7314   1.163    0.329

Residual standard error: 22.62 on 3 degrees of freedom
Multiple R-squared:  0.5276,    Adjusted R-squared:  0.2127 
F-statistic: 1.676 on 2 and 3 DF,  p-value: 0.3246

edited May 13 '19 at 23:19

answered May 13 '19 at 21:42

Dij

1,318
1
7
13

1

thanks for the answer its not working for my problem I will clearly explain it i have more than 100 variable in my dataset dependent variable in the model needs to same_name_2 and same_name_1 is the independent variable we need built linear model for it – vka May 13 '19 at 21:49
I updated the loop to fit the data you added to your question. It should work now! – Dij May 13 '19 at 21:51
so are you sayin all of your independent and dependent variables follow the format: [VAR.NAME1]_2 for the dependent variable and [VAR.NAME1]_1 for the independent var? – Dij May 13 '19 at 22:30
Yes could please read questions once again you can clear understand the problem – vka May 13 '19 at 22:42
I'm not sure what you want in the output still, but here is a way to grab your variable names. Given that they are in the same order (where the DV comes first, and the IV comes second for each pair) – Dij May 13 '19 at 23:20
this not working i am getting Error in model.frame.default(formula = df[, DV] ~ df[, IV] + Age, data = df, : invalid type (list) for variable 'df[, DV]' – vka May 14 '19 at 04:44
worked like a charm when i ran it. just copy and paste it – Dij May 14 '19 at 13:50

score 0 · Answer 2 · answered May 13 '19 at 22:26

Here's a tidyverse approach:

library(tidyverse)

dataset %>% 
  gather(col, val, -ID, -AGE) %>%
  separate(col, c("name", "num")) %>%
  spread(num, val) %>%
  group_by(name) %>%
  group_map(~lm(`2` ~ `1` + AGE, data = .x) %>% broom::tidy())

# A tibble: 9 x 6
# Groups:   name [3]
  name  term        estimate std.error statistic p.value
  <chr> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 A     (Intercept)  -15.0      31.5      -0.477  0.666 
2 A     `1`            0.463     0.332     1.39   0.258 
3 A     AGE            0.851     0.731     1.16   0.329 
4 B     (Intercept)   49.1      52.5       0.935  0.419 
5 B     `1`           -0.359     0.801    -0.448  0.685 
6 B     AGE            0.391     2.47      0.159  0.884 
7 C     (Intercept)    5.42     13.9       0.390  0.723 
8 C     `1`            0.932     0.289     3.23   0.0483
9 C     AGE           -0.299     0.470    -0.635  0.570

Explanation:

Move data to long/tidy format with gather()
separate the variable category (in the example data, A, B, etc)
Create a separate column for the IV and DV with spread()
Use group_by() and group_map() to apply lm() to each variable category.

Jason Johnson · Accepted Answer · 2019-05-14T15:02:11.110

Here is a base R approach to the problem with lapply loops.

First if you want to automatically extract the variable names ending in _2 which should be all of your dependent variables you could implement the following code:

dep_vars<-grep("_2$",colnames(dataset),value = T) #This selects all variables ending in `_2` which should all be dependent variables.

reg_vars<-gsub("_2$","",dep_vars) #This removes the `_2` from the dependent variables which should give you the common stem which can be used to select both dependent and independent variables from your data frame.

Then you can use this in your lapply loop for creating your formulas:

full_results <- lapply(reg_vars, function(i) summary(lm(paste0("log(",i,"_2)~",i,"_1+AGE"),data=dataset)))

Now you will have a list of linear regression summaries where you can extract the info you want. I think this is what you want for the output but please clarify if not:

print_results<-lapply(full_results,function(i) data.frame(
                                            Dep_va = names(attributes(i[["terms"]])$dataClasses)[1], 
                                            Ind_var = names(attributes(i[["terms"]])$dataClasses)[2],
                                            Covariates = names(attributes(i[["terms"]])$dataClasses)[3], 
                                            Pvalue = i[["coefficients"]][2,4],
                                            upper.cI = i[["coefficients"]][2,1]+1.96*i[["coefficients"]][2,2],
                                            low.cI = i[["coefficients"]][2,1]-1.96*i[["coefficients"]][2,2]))

That code will give you a list of data frames and if you want to combine it into one data.frame:

final_results<-do.call("rbind",print_results)

Output Results:

Dep_va Ind_var Covariates     Pvalue upper.cI     low.cI
1    A_2     A_1        AGE 0.25753805 1.113214 -0.1877324
2    B_2     B_1        AGE 0.68452053 1.211355 -1.9292236
3    C_2     C_1        AGE 0.04827506 1.497688  0.3661343

Hope that helps! Let me know if you were looking for different output results.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/193305/discussion-on-answer-by-jason-johnson-writing-loop-function-to-generate-various). — Samuel Liew, May 14 '19 at 06:40
@Jason Johnson i used the exact above code just made one modified which i made my data frame name as df — vka, May 14 '19 at 06:47
@vinoth A N So you must have more than one underscore in your actual variable names so I edited my answer to account for this. Without being able to see the actual column names it is difficult to find the appropriate pattern to use to automatically select the variables but I hope that works for you. The code assumes all variables ending in `_2` are dependent variables and that removing the `_2` from these will leave you with the common name stems in order to select the corresponding `_1` independent variables for your regressions. — Jason Johnson, May 14 '19 at 13:42
@JasonJohnson I have take log transformation on the dependent variable in this step full_results <- lapply(reg_vars, function(i) summary(lm(paste0(i,"_2~",i,"_1+AGE"),data=dataset))) — vka, May 14 '19 at 14:45
@vinothAN Then the trick is to just write that in the `paste0` function. I updated the code for you. — Jason Johnson, May 14 '19 at 15:04
@vinothAN Great to hear! Please accept as an answer if this works for you. — Jason Johnson, May 14 '19 at 15:25
@JasonJohnson I going to try the same method with the mixed model and i will let you know if face some issue — vka, May 14 '19 at 15:28

Writing loop/function to generate various linear regressions on same dataframe

3 Answers3

Linked