R: How to identify mistake in bivariate linear regression in r

Question

I am new to linear regression. I am running a simple linear regression using two variables with lm. The issue is that I generating different results. I have done the coding twice to see if the model's output it the same. It isn't, suggesting I have made a mistake in one of the attempts.

The output in my first attempt shows 12 elements in the data overview in environment tab. The output in my second attempt shows 14 elements in the data overview in environment tab.

How do I know which output is the right one? Is it the first attempt as the DV values are 1-7 whereas in the second attempt the DV also includes values for -1 changing i.e. wrong as this cannot be included for an interval level variable?

How do I go about identifying the mistake? I saw a difference in elements in the data overview and started by looking at the differences. Yet, I can't see anything but am guessing that it is to do with the values I mention above of -1 and -999. Is this a good place to start? Are there better, other ways?

Many thanks for helping me understand!

Here is the code for my first attempt:

reg <-lm(immig.view~edu.degree.level,df1)
> reg

    Call:
    lm(formula = immig.view ~ edu.degree.level, data = df1)

    Coefficients:
     (Intercept)  edu.degree.level  
           4.1734            0.3464  

> dput(head(df1,10))
structure(list(edu.degree.level = c(1L, 0L, 1L, 1L, 0L, 1L, 1L, 
1L, 1L, 0L), immig.view = structure(c(7, 4, 5, 1, 7, 5, 7, 1, 
3, 1), label = "J1 Do you think immigration is good or bad for Britain's economy?", labels =              c(`Not stated` = -999, 
   `Don`t know` = -1, `1 Bad for economy` = 1, `2` = 2, `3` = 3, 
    `4` = 4, `5` = 5, `6` = 6, `7 Good for economy` = 7), class = "haven_labelled")), row.names        = c(NA, 
10L), class = "data.frame")

Here is the code for my second attempt:

> reg <-lm(immig.view~edu.degree.level,df1)
> reg

Call:
lm(formula = immig.view ~ edu.degree.level, data = df1)

Coefficients:
                      (Intercept)  edu.degree.levelwithoutdegree  
                           4.5198                        -0.3431  

    > dput(head(df1,10))
    structure(list(edu.degree.level = structure(c(1L, 2L, 1L, 1L,  
    2L, 1L, 1L, 1L, 1L, 2L), .Label = c("withdegree", "withoutdegree"
    ), class = "factor"), immig.view = structure(c(7, 4, 5, 1, 7, 
    5, 7, 1, 3, 1), label = "J1 Do you think immigration is good or bad for Britain's    economy?",   labels = c(`Not stated` = -999, 
    `Don`t know` = -1, `1 Bad for economy` = 1, `2` = 2, `3` = 3, 
    `4` = 4, `5` = 5, `6` = 6, `7 Good for economy` = 7), class = "haven_labelled")), row.names =  c(NA, 
10L), class = "data.frame")

Thanks again.

score 1 · Accepted Answer · answered Apr 08 '20 at 09:56

1

As i understand it, your problem is that you do not get the same coefficients in your two attempts at doing linear regression.

The reason for this is that the data you are doing your linear regressions on are different for each attempt.

Education Data in first attempt:

edu.degree.level = c(1L, 0L, 1L, 1L, 0L, 1L, 1L, 
1L, 1L, 0L)

Education data in second attempt

   edu.degree.level = structure(c(1L, 2L, 1L, 1L,  
        2L, 1L, 1L, 1L, 1L, 2L)

Yet the answers (immig.view) are the same.

The regression coefficients are the best estimate to predict a line in your data, and if your data is different for each attempt, the estimates will also be different.

answered Apr 08 '20 at 09:56

brendbech

399
1
7

Thank you. Yes, the mistake is in the data for the education variable that's helpful. I'm unclear on what 1L, 0L, 2L mean and so still unclear on which parts of the data are the wrong ones. Thanks again. – Honey Badger Apr 08 '20 at 10:50
As the edu.degree.level is a dummy variable does this mean that my first attempt is the correct one for the regression as the values are assigned 1 and 0? – Honey Badger Apr 08 '20 at 14:17
0L, 1L and 2L stands for integer. 0L is 0, 1L is 1, 2L is 2. I don't know your data, and therefore do not know what the 0, 1 and 2 represent, but i'm guessing that 0 stands for basic education, 1 stands for high school level and 2 is higher education or something like that. What is correct to use is up to you and should be based on your prior knowledge of the data. – brendbech Apr 08 '20 at 17:02

R: How to identify mistake in bivariate linear regression in r

1 Answers1