Using gather from tidyr changes my regression results

Question

When I run the code below, everything works as expected

# install.packages("dynlm")
# install.packages("tidyr")
require(dynlm)
require(tidyr)


Time <- 1950:1993

Y <- c(5820, 5843, 5917, 6054, 6099, 6365, 6440, 6465, 6449, 6658,  6698, 6740, 6931, 
       7089, 7384, 7703, 8005, 8163, 8506, 8737, 8842, 9022, 9425,  9752, 9602, 9711, 
       10121, 10425, 10744, 10876, 10746, 10770, 10782, 11179, 11617, 12015, 12336, 
       12568, 12903, 13029, 13093, 12899, 13110, 13391)

X <- c(6284, 6390, 6476, 6640, 6628, 6879, 7080, 7114, 7113, 7256, 7264, 7382, 7583, 7718,  
       8140, 8508, 8822, 9114, 9399, 9606, 9875, 10111, 10414, 11013, 10832, 10906, 11192, 
       11406, 11851, 12039, 12005, 12156, 12146, 12349, 13029, 13258, 13552, 13545, 13890, 
       14005, 14101, 14003, 14279, 14341)

data <- data.frame(Time, Y, X)

data_ts <- ts(data, start = 1950, end = 1993, frequency = 1)

Modell <- dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + log(L(X, 3)) 
                     + log(L(X, 4)) + log(L(X, 5)), data = data_ts)
summary(Modell)

My summary output in this case is this

...        
              Estimate  Std. Error t value Pr(>|t|)    
(Intercept)  -0.059109   0.091926  -0.643    0.525    
log(X)        0.883020   0.145754   6.058 9.17e-07 ***
log(L(X))     0.004167   0.211420   0.020    0.984    
log(L(X, 2)) -0.092880   0.207026  -0.449    0.657    
log(L(X, 3)) -0.012016   0.210395  -0.057    0.955    
log(L(X, 4))  0.200596   0.212370   0.945    0.352    
log(L(X, 5))  0.014497   0.144103   0.101    0.920 
...

Now, when I use gather() to define a new data frame for some plots

data_tidyr <- gather(data, "Key", "Value", -Time)

and re-run the above code not changing anything else I get this summary:

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.05669    0.07546  -0.751    0.457    
log(X)        0.82128    0.13486   6.090 3.53e-07 ***
log(L(X))     0.17484    0.13365   1.308    0.198    
log(L(X, 2))       NA         NA      NA       NA    
log(L(X, 3))       NA         NA      NA       NA    
log(L(X, 4))       NA         NA      NA       NA    
log(L(X, 5))       NA         NA      NA       NA

I am puzzled by this behaviour as the gather operations (defining a new data frame with columns gathered into rows) has nothing do to with the data set I am using to run my regression (at least this was my impression). Somehow using gather() changes the way calculation is done, but I cannot see how. Help would be much appreciated!

Some number:

"dynlm" version 0.3-3
R version: 3.2.0 (64 bit)

Update

Ok thank you for all the answers and comments so far, but the question remains: WHAT is going on in the environment? I want to know why and how this happens. To me this is something serious, since to my understanding avoiding non-intended side-effects of one function call on others is precicly what functional languages like R are trying to achieve. Now, unless I am missing something here, this behaviour seems to be at odds with that intention.

I also do not get the same results in the first summary output — rawr, Jun 02 '15 at 16:37
I get the first results but the `data_tidyr` is a very different data set. Give a look at it. — SabDeM, Jun 02 '15 at 16:42
@SabDeM yes its a different data set but since this data set is not in any way involved in the regression creating it should not have any impact on the result. Thats precicly whats puzzeling me! — Manuel R, Jun 02 '15 at 17:48
@ManuelS I am sorry there was a misunderstanding. I though the second data frame was used as an argument of the model. Now I can reproduce your code precisely and it puzzles me too. — SabDeM, Jun 02 '15 at 18:14
Can you give your r version and the version of the dynlm package? — Elin, Jun 02 '15 at 18:15
I can only seem to reproduce this if I overwrite the original `data_ts` with the tidied dataset (instead of naming it `data_tidyr`). — aosmith, Jun 02 '15 at 19:08
While I still can't reproduce this (even in R 3.2.0 - maybe add your session info?), I think removing `Time`, `Y`, and `X` from the environment will help you troubleshoot. For example, use `data <- data.frame(Time = 1950:1993, Y = c(5280, ...), X = c(6284, ...))` instead of creating each object individually (note the use of `=` inside `data.frame` to keep your variables within `data`). — aosmith, Jun 03 '15 at 15:25
@aosmith i tried that but unfortunately it didnt change anything. still the same issue... — Manuel R, Jun 03 '15 at 15:44

score 7 · Accepted Answer · answered Jun 04 '15 at 08:31

The underlying reason for this unexpected change is that dplyr (dplyr, not tidyr) changes the default method of the lag function. The gather function calls dplyr::select_vars, which loads dplyr via namespace and overwrites lag.default.

The dynlm function internally calls lag when you use L in the formula. The method dispatch then finds lag.default. When dplyr is loaded via namespace (it does not even need to be attached), the lag.default from dplyr is found.

The two lag functions are fundamentally different. In a new R session, you will find the following difference:

lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA  1  2

So the solution is fairly simple. Just overwrite the lag.default function yourself.

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928      

lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1951, End = 1993
## 
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##  (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
##     -0.05669       0.82128       0.17484            NA  

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928

Thank you very much, this was very helpful! However I think this kind of unexpected side effect behaviour is in general kind of critical, although it is probably impossible to avoid completely. But anyway: thanks! — Manuel R, Jun 04 '15 at 08:56
I agree that this side effect is unfortunate. Especially since there seems to be no way to figure out if there is such an undesired side effect. I have since asked [this question](http://stackoverflow.com/q/30641471/2591234) about how to find methods masking. — shadow, Jun 04 '15 at 12:00

Elin · Answer 2 · 2015-06-03T09:27:56.317

-1

When I ran your first block of code in R 3.1.3 I got the results you are showing as your second set of results with this:

(R Version 3.1.3, dynlm version .3-3).

Time series regression with "ts" data:
Start = 1951, End = 1993

Call:
dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + 
    log(L(X, 3)) + log(L(X, 4)) + log(L(X, 5)), data = data_ts)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.030753 -0.006364  0.001321  0.007939  0.025982 

Coefficients: (4 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.05669    0.07546  -0.751    0.457    
log(X)        0.82128    0.13486   6.090 3.53e-07 ***
log(L(X))     0.17484    0.13365   1.308    0.198    
log(L(X, 2))       NA         NA      NA       NA    
log(L(X, 3))       NA         NA      NA       NA    
log(L(X, 4))       NA         NA      NA       NA    
log(L(X, 5))       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01419 on 40 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9972 
F-statistic:  7578 on 2 and 40 DF,  p-value: < 2.2e-16

However when I updated to R 3.2.0 I got a repeat of what you got initially but then settled back into always getting the second results.

Now from your later comments you are also getting the second results consistently. So I think it must be that it is either that there was a typo in the code at some point or that there is something about the first time this runs in an empty environment.

Based on that hypothesis I closed RStudio completely, restarted and ran the first codeblock. In that case I got your initial result again.

So I think the answer to your question has to be there is something odd going on in the environment.

I read the documentation for dynlm and there are a few places where the defaults (if they are coming into play) would cause differences potentially. For example it will take the variables from the environment if no data is specified. It will use either a timeseries object or a dataframe. In your case you have both in the environment (data and data_ts). If you notice on the summary output that I have above it says Time series regression with "ts" data: which means it is running with a ts object. When I am getting the other result (without the NAs, your first result) it says Time series regression with "numeric" data: and in that case it is running with on data which is a dataframe or X and Y directly. So I think that must be the source of the difference. I'm not sure why exactly that would have happened with data_ts explicitly named though.

edited Jun 03 '15 at 09:27

answered Jun 02 '15 at 16:27

Elin

6,507
3
25
47

Yes i got the same "Coefficients: (4 not defined because of singularities)". Still whatever way we look at it, the question remains: why is there a difference? – Manuel R Jun 02 '15 at 17:46
The difference between the first run where I have not yet executed the gather() command and the second run. Apparently you got the second result straight away, so there is no issue for you. Still isnt it odd that running one line of code that is (so it seems) totally unrelated to all the other lines suddently changes the results?? – Manuel R Jun 02 '15 at 17:59
Correct, it would be very odd if that were happening, but running your code as posted above does not produce two different results, it produces the same results. Copy and paste the above code into R (uncommenting the install if needed). The results are like those you are reporting as your second result. Then do the gather(). THen either rerun the dynlm() or rerun the whole thing. Same results. – Elin Jun 02 '15 at 18:08
This is at least interessting to know, since this is exactly what is NOT happening when I run it on my computer. That is precicly the strange thing. By the way: did you run my code in a fresh R session? – Manuel R Jun 02 '15 at 18:14
I was actually going to ask you to rerun making sure you clear anything that is Modell in your current environment. Also I wonder if there is a difference in versions of dynlm since I just downloaded? – Elin Jun 02 '15 at 18:21
I did that and everything should be up to date, since I set up my computer just 1 1/2 weeks ago. Actually the question i was asking was the result of a trial an error process to minimze the number of possible causes for the issue. I boiled it down to this line of code. – Manuel R Jun 02 '15 at 18:31
Suggestion: start with a clear environment and create your data. Make a copy of that data as data.chk. Then run the first dynlim(). THen run the second dynlim() on data and on data.chk.. See if the results differ. (DOn't run the gather() in between) – Elin Jun 02 '15 at 18:46
Ok sorry i missed something. The results are the same now :-) – Manuel R Jun 02 '15 at 19:06
Oh boy but I updated my R and now I'm getting your first results! – Elin Jun 02 '15 at 19:27
I really disagree as you can see by comparing results we are able to see that there is something going on and get an answer. It's not at all a critique or criticism of the question or a request for clarification. – Elin Jun 03 '15 at 01:42

Using gather from tidyr changes my regression results

Update

2 Answers2

Linked