solutions for factor levels when using linear models in R

Question

I am running linear models to look at the significance of independent factors involved. The example model is: `

mymod1 <- lm(temp ~ bgrp+psex+tb,data=mydat)
summary(mymod1)`

I look at the summary to check out the significance of each factor:

lm(formula = temp ~ bgrp + psex + tb, data = mydat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.324459   0.186081 200.581  < 2e-16 ***
bgrp         0.256794   0.066167   3.881 0.000115 ***
psex         0.144669   0.055140   2.624 0.008913 ** 
tb           0.019818   0.009342   2.121 0.034287 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.03675,    Adjusted R-squared: 0.03209 
F-statistic: 7.897 on 3 and 621 DF,  p-value: 3.551e-05

Now, I would like to look at the solutions of the two levels of bgrp (1 and 2) and psex (1 and 2).

I would appreciate if you could help me with this.

Thanking you in advance,

Baz

EDIT:

I ran the first model you suggested and got the following:

mydat$bgrp <- as.factor(mydat$bgrp)

> summary(lm(temp ~ bgrp+psex+tb-1,data=mydat))

Call:
lm(formula = temp ~ bgrp + psex + tb - 1, data = apirt)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
bgrp1 37.725922   0.135486 278.449  < 2e-16 ***
bgrp2 37.982716   0.129558 293.171  < 2e-16 ***
psex2  0.144669   0.055140   2.624  0.00891 ** 
tb     0.019818   0.009342   2.121  0.03429 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.9997,     Adjusted R-squared: 0.9997 
F-statistic: 4.788e+05 on 4 and 621 DF,  p-value: < 2.2e-16

From the above coefficient table, bgrp1 and bgrp2 seem to make sense: bgrp1 represents the maternal lines with larger litter sizes, lighter offsprings, which results in lower rectal temperature (37.70 degrees C) of the offspring. On the other hand, bgrp2 represents the terminal lines with smaller litter size, heavier offsprings, which results in higher a rectal temperature (37.98 degrees C). I am just wondering, if the same could be done for psex1 and psex2, but what is presented in the table of coefficients could be due to what you said earlier.

EDIT: Hi Mark,

I tried the two options you suggested and I could see that bgrp1 and psex1 are taking on the same values:

> mybgrp <- lm(formula = temp ~ bgrp+psex+tb-1, data = mydat)
> mybgrp

Call:
lm(formula = temp ~ bgrp + psex + tb - 1, data = mydat)

Coefficients:
   bgrp1     bgrp2     psex2        tb  
37.72592  37.98272   0.14467   0.01982  

> summary(mybgrp)

Call:
lm(formula = temp ~ bgrp + psex + tb - 1, data = mydat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
bgrp1 37.725922   0.135486 278.449  < 2e-16 ***
bgrp2 37.982716   0.129558 293.171  < 2e-16 ***
psex2  0.144669   0.055140   2.624  0.00891 ** 
tb     0.019818   0.009342   2.121  0.03429 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.9997,     Adjusted R-squared: 0.9997 
F-statistic: 4.788e+05 on 4 and 621 DF,  p-value: < 2.2e-16 

> mypsex <- lm(formula = temp ~ psex+bgrp+tb-1, data = mydat)
> mypsex

Call:
lm(formula = temp ~ psex + bgrp + tb - 1, data = mydat)

Coefficients:
   psex1     psex2     bgrp2        tb  
37.72592  37.87059   0.25679   0.01982  

> summary(mypsex)

Call:
lm(formula = temp ~ psex + bgrp + tb - 1, data = mydat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
psex1 37.725922   0.135486 278.449  < 2e-16 ***
psex2 37.870591   0.135908 278.649  < 2e-16 ***
bgrp2  0.256794   0.066167   3.881 0.000115 ***
tb     0.019818   0.009342   2.121 0.034287 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.9997,     Adjusted R-squared: 0.9997 
F-statistic: 4.788e+05 on 4 and 621 DF,  p-value: < 2.2e-16

Thanks!

What do you mean by "solutions"? I've seen the term used in SAS but as I recall, it just means the table of coefficients, like you get from `summary.lm`. — Aaron left Stack Overflow, Mar 11 '12 at 19:05
Is there a way of looking at the coefficients of the different levels of these independent factors? — baz, Mar 11 '12 at 19:12
Define both bgrp and psex as factors. I suppose you could then create two models with the order of bgrp and psex reversed: summary(lm(temp ~ psex+bgrp+tb-1,data=mydat)) and summary(lm(temp ~ bgrp+psex+tb-1,data=mydat)). I believe both models are the exact same model, but would just be displaying different slopes. — Mark Miller, Mar 12 '12 at 13:35
@Mark Miller: That is an excellent approach! I was just wondering of how can I get both the levels for psex. Great work!!! — baz, Mar 12 '12 at 22:10
@Mark Miller, I tried to run the model the order of bgrp and psex reversed and bgrp1 and psex1 are taking on the same values. I have added the table of coefficients as an "EDIT" below my question above. Thanks! — baz, Mar 12 '12 at 23:36
I am not sure I understand. In each of the three snapshots of model output above you gave bgrp before psex. I do not see model output for a model in which psex appears before bgrp in the model statement. — Mark Miller, Mar 13 '12 at 03:45
Oh, wait. Sorry. I see now. I had to scroll down with the third snapshot. — Mark Miller, Mar 13 '12 at 04:43
@Mark Miller: Is there an easy way of avoiding the first levels of the two factors to take on the same values? — baz, Mar 13 '12 at 23:47
No, I do not think so. In fact, the slopes in the 2 models are not independent. The slopes are with respect to the intercept. Switching the order of the variables in the model only changes the intercept term. The value of temp is 37.7 units for psex1 in Bgrp1 when tb is zero regardless of whether psex1 is the intercept (the top model in the third snapshot) or bgrp1 is the intercept (the bottom model in the 3rd snapshot). — Mark Miller, Mar 14 '12 at 02:44
@Mark Miller, this is a very simple question and a bit embarrased to ask. In the model give: mybgrp <- lm(formula = temp ~ bgrp+psex+tb-1, data = mydat), total born (tb) is put in as a linear covariate and after getting its coefficient (0.01982), I am tryng to make sense out of it in regards to temperature (temp), which is the dependent variable. Any help would be appreciated! — baz, Apr 14 '12 at 01:35

Mark Miller · Accepted Answer · 2012-03-11T23:26:57.073

If there are only two levels to a variable (1 vs 2), that is the same as (0 vs 1) and the slope is for one of those 2 levels. The other level of the variable is included in the intercept term.

Maybe try

lm(formula = temp ~ bgrp + psex + tb - 1 , data = mydat)

to remove the intercept and see if that gives you what you want.

Then again, maybe I do not understand your question correctly.

Edit:

When I use fake data and set

bgrp <- as.factor(bgrp)
psex <- as.factor(psex)

without an intercept I get 2 slopes for one of the 2 factors. I believe R is holding the second slope for the second factor = 0.

Edit2:

This model will provide separate slopes for each combination of bgrp and psex. The model includes an interaction between bgrp and psex and then removes the intercept and the bgrp and psex main effects:

mymod2 <- lm(temp ~ bgrp + psex + bgrp * psex + tb - 1 - bgrp - psex)

Edit3:

If you are used to using SAS and try running the same analysis in SAS and R you might find that the two programs initially do not return the same estimates. That may be because SAS and R select different factor levels for the intercept by default. You can alter the default factor level for the intercept in R to match that used by SAS and then you might find that both programs give the same answer.

Compare the following R code with the SAS output from here:

http://support.sas.com/kb/38/384.html

where the SAS code makes use of the option 'solution':

my.data <- matrix(c(
'A', 'F',   9, 25,  
'A', 'F',   3, 19,  
'A', 'F',   4, 18,  
'A', 'F',  11, 28,  
'A', 'F',   7, 23,
'A', 'M',  11, 27,  
'A', 'M',   9, 24,  
'A', 'M',   9, 25,  
'A', 'M',  10, 28,  
'A', 'M',  10, 26,
'D', 'F',   4, 37,  
'D', 'F',  12, 54,  
'D', 'F',   3, 33,  
'D', 'F',   6, 41,  
'D', 'F',   9, 47,
'D', 'M',   5, 36,  
'D', 'M',   4, 36,  
'D', 'M',   7, 40,  
'D', 'M',  10, 46,  
'D', 'M',   8, 42,
'G', 'F',  10, 70,  
'G', 'F',  11, 75,  
'G', 'F',   7, 60,  
'G', 'F',   9, 69,  
'G', 'F',  10, 71,
'G', 'M',   3, 47,  
'G', 'M',   8, 60,  
'G', 'M',  11, 70,  
'G', 'M',   4, 49,  
'G', 'M',   4, 50
), nrow = 30, byrow=T, 
dimnames = list(NULL, c("drug","gender","x","y")));


my.data <- as.data.frame(my.data, stringsAsFactors=F)
my.data

my.data$y      <- as.numeric(my.data$y)
my.data$x      <- as.numeric(my.data$x)
my.data$drug   <- as.factor(my.data$drug)
my.data$gender <- as.factor(my.data$gender)

str(my.data)

my.data$drug   <- relevel(my.data$drug, ref="G")
my.data$gender <- relevel(my.data$gender, ref="M")



my.mod1 <- lm(my.data$y ~ my.data$drug)
my.mod1
summary(my.mod1)

my.mod2 <- lm(my.data$y ~ my.data$drug-1)
my.mod2
summary(my.mod2)

my.mod3 <- lm(my.data$y ~ my.data$drug + my.data$gender + 
                          my.data$drug * my.data$gender )
my.mod3
summary(my.mod3)

my.mod4 <- lm(my.data$y ~ my.data$drug + my.data$gender + 
                          my.data$drug * my.data$gender - 1 )
my.mod4
summary(my.mod4)

my.mod5 <- lm(my.data$y ~ my.data$drug + my.data$x + 
                          my.data$drug * my.data$x )
my.mod5
summary(my.mod5)

my.mod6 <- lm(my.data$y ~ my.data$drug + my.data$x + 
                          my.data$drug * my.data$x - 1 )
my.mod6
summary(my.mod6)

@ Mark Miller: Before I go through your answer, I would like to mention up front that my supervisor uses SAS and I am using R. I am learning to do these analyses in R and do not know anything about SAS...so now you can have some idea of the situation I am in. It is a challenge and I am pushing through. Anyway, I will go through your scripts and would get back to you when I need further help, if that is possible? — baz, Mar 12 '12 at 01:18
Feel free to ask a follow-up question if you wish. You can post it here on this board or use my email on my user page. I can try to help. However, if it is a statistics question you might get the best advice by posting here: http://stats.stackexchange.com/questions — Mark Miller, Mar 12 '12 at 07:07

solutions for factor levels when using linear models in R

1 Answers1