0

Issue:

The following code:

lm(mpg ~ factor(am), data = mtcars)

Should produce the following output (let's call it output 1), which I do get when running it from an r script in R Studio:

Call:
lm(formula = mpg ~ factor(am), data = mtcars)

Coefficients:
(Intercept)  factor(am)1  
     17.147        7.245  

However, I "sometimes" get a different output, with this exact same code, if I run it from an Rmd file (also in R Studio). Let's call this output 2:

Call:
lm(formula = mpg ~ factor(am), data = mtcars)

Coefficients:
(Intercept)  factor(am)1  
     20.770       -3.622  

If after getting output 2 from the Rmd file, I go back to the R script, then I keep getting output 2 instead of output 1 and the only way of getting output 1 back again is to close R Studio and open only the R script again.

Why output 1 makes sense to me?

17.147 is the average mpg of the group of cars with automatic transmission (reference group) and 7.245 is the increment in average mpg of the group of cars with manual transmission (which adds up to 24.392).

This can be confirmed with:

tapply(mtcars$mpg, mtcars$am, mean)

       0        1 
17.14737 24.39231 

What I find weird with output 2?

Besides that it is a different result from the exact same command line, 20.770 doesn't really tell me anything (I believe). Even though it is close to the average mpg of the whole sample, it is not exactly it. Taking the -3.622 together with the 20.770 does add up to 17.147 which is the average speed of the group of cars with automatic transmission, and adding 3.622 to 20.770 adds up to 24.392 which is the average speed of the group of cars with manual transmission.

Even though output 2 is different than output 1 and inconsistent because I never know which one I am going to get, I haven't seen a third or more variations.

Additional details:

I'm not loading any packages neither running any additional command lines in any case.
The mtcars dataset is the one included in base R.
I have R version 3.6.3, R Studio version 1.2.5033 and Windows 10.

Phil
  • 7,287
  • 3
  • 36
  • 66
LouJay
  • 1
  • 1
  • 1
    Is your non-deterministic output based on `mtcars` itself, or are you using it merely to demonstrate what's happening with your other dataset? Lacking the whole picture, I wonder if it's something in your code that's inadvertently changing (in side-effect) the dataset in some way, where restarting R results in loading the unchanged dataset. Are you using `data.table`? `dplyr`? Any custom packages? C/C++ (`Rcpp`) code? Are you using `<<-` or `assign` anywhere in the rest of your code? – r2evans Apr 11 '20 at 03:42
  • 1
    you may have some object loading in the rmd or in rstudio (under tools > global options > general do you have "load Rdata at startup" checked? for the first one which is correct, you shouldn't have `mtcars` in your global environment, but if you changed mtcars and put it in your workspace, you would see "mtcars" after `ls()` also when you get the wrong answer, compare `summary(mtcars)` vs `summary(datasets::mtcars)` – rawr Apr 11 '20 at 03:46

1 Answers1

0

Thanks for your comments @r2evans and @rawr, they pointed me in the right direction. It turns out the issue was that the Rmd file was changing the options in a setup chunk {r setup, include=FALSE}, which means that it gets executed the first time I execute any chunk in that Rmd file.

I hadn't noticed that, since the code with the issue was in a separate chunk, and that is the one I executed directly. I hadn't used / seen setup chunks before.

The specific code in that setup chunk that was causing the lm() function to behave differently with a factor variable was:

options(contrasts = c("contr.sum", "contr.poly"))

when the default is:

options(contrasts = c("contr.treatment","contr.poly"))

Just in case this helps anyone. If you feel I need to edit the topic, question or answer for this to be helpful to someone else, let me know.

LouJay
  • 1
  • 1