Residuals from first differenced regression on unbalanced panel

Question

I am trying to use plm to estimate a first differenced model on some unbalanced panel data. My model seems to work and I get coefficient estimates, but I want to know if there is a way to get the residual (or fitted value) per observation used.

I have run into two problems, I don't know how to attach residuals to the observation they are associated with, and I seem to get an incorrect number of residuals.

If I retrieve the residuals from the estimated model using model.name$residuals, I get a vector that is shorter than model.name$model.

require(plm)
X <- rnorm(14)
Y <- c(.4,1,1.5,1.3,1,4,5,6.5,7.3,3.7,5,.7,4,6)
Time <- rep(1:5,times=2)
Time <- c(Time, c(1,2,4,5))
ID <- rep(1:2,each=5)
ID <- c(ID,c(3,3,3,3))
TestData <- data.frame("Y"=Y,"X"=X,"ID"=ID,"Time"=Time)
model.name <- plm(Y~X,data=TestData,index = c("ID","Time"),model="fd")

> length(model.name$residuals)
[1] 11
> nrow(model.name$model)
[1] 14

(Note: ID=3 is missing an observation for t=3)

Looking at model.name$model I see it includes all observations, including t=1 for each member of ID. In the first differencing the t=1 observations will be removed, so in this case both IDs with all time periods should have 4 residuals from the remaining time periods. ID=3 should have a residual for t=2, none for t=3 as it is missing, none for t=4 as there is no value to difference (due to the missing t=3 value) and then a residual for t=5.

From this it seems that there should be 10 residuals, but I have 11. I would appreciate any help with why there are this many residuals, and how to connect residuals to the correct index (ID and Time).

Helix123 · Accepted Answer · 2019-02-22T14:10:59.120

The lagging done with model="fd" is based on the neighbouring rows, not the actual value of the time index. Thus, if you have non-consecutive time periods, this will give you unexpected results. To avoid this, do the differencing yourself while respecting the time period when lagging and estimate a pooling model. The unbalancedness of the data is not of concern here.

Since version 1.7.0 of package plm, there lag() function performs lagging based on the value of the time period per default (previous default was neighboring rows). Use this function to do the lagging yourself.

Continuing your example:

pTestData <- pdata.frame(TestData, index=c("ID", "Time"))

pTestData$Y_diff <- plm::lag(pTestData$Y) - pTestData$Y
pTestData$X_diff <- plm::lag(pTestData$X) - pTestData$X
fdmod <- plm(Y_diff ~ X_diff, data = pTestData, model = "pooling")
length(residuals(fdmod)) # 10
nrow(fdmod$model)        # 10

I explicity used plm:: when referring to the lag function as several other packages have a lag function as well (most notably stats and dplyr) and you want to use the one from package plm here. To augment the residuals to the differenced data (actually used for computing the model), just do something like: dat <- cbind(fdmod$model, residuals(fdmod))

Also, you might be interested in the function is.pconsecutive to check for consectutiveness of your data:

is.pconsecutive(pTestData)
#    1     2     3 
# TRUE  TRUE FALSE

Function make.pconsecutive will make your data consecutive by inserting rows with NA values for the missing period.

Thank you, this is very helpful and answers the question. This might be more appropriate as a new question, but do you have a suggestion as to how to resolve the same issue when using plm with "fd" and IVs, or when using pgmm? — Misophist, Sep 08 '16 at 05:18

Residuals from first differenced regression on unbalanced panel

1 Answers1

Linked