use model object, e.g. panelmodel, to flag data used

Question

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.

Let me illustrate my issue with a minimal working example,

First some packages needed,

# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)

Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,

data("Grunfeld", package = "plm")
dta <- Grunfeld

now I create some semi-random missing values in my data object, dta

dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA

final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,

dta.p <- dta %>% group_by(firm, year)

Now to the regression

plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")

the results, using stargazer,

stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#>                  Dependent variable:    
#>              ---------------------------
#>                          inv            
#> ----------------------------------------
#> value                 0.114***          
#>                        (0.008)          
#>                                         
#> capital               0.237***          
#>                        (0.028)          
#>                                         
#> Constant             -47.962***         
#>                        (9.252)          
#>                                         
#> ----------------------------------------
#> Observations             178            
#> R2                      0.799           
#> Adjusted R2             0.797           
#> F Statistic   348.176*** (df = 2; 175)  
#> ===========================================
#> Note:        *p<0.1; **p<0.05; ***p<0.01

Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.

I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.

I imagine something like,

dta <- dta %>% valid_reg_obs(plm.reg)

The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,

dta %>% as_tibble()
#> # A tibble: 200 x 6
#>     firm  year   inv value capital plm.reg
#>  * <int> <int> <dbl> <dbl>   <dbl> <lgl>  
#>  1     1  1935   318  3078    2.80 T      
#>  2     1  1936   392  4662   52.6  T      
#>  3     1  1937    NA  5387  157    F      
#>  4     1  1938    NA  2792  209    F      
#>  5     1  1939    NA  4313  203    F      
#>  6     1  1940    NA  4644  207    F      
#>  7     1  1941    NA  4551  255    F      
#>  8     1  1942    NA  3244  304    F      
#>  9     1  1943    NA  4054  264    F      
#> 10     1  1944    NA  4379  202    F      
#> # ... with 190 more rows

Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,

# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta) 
#> Error in data.frame(..., check.names = FALSE) : 
#>   arguments imply differing number of rows: 200, 178

In case you do not need the link to your original data, just look at `plm.reg$model` to get the observations used in estimation. — Helix123, Feb 20 '18 at 20:38

score 1 · Accepted Answer · edited Feb 20 '18 at 07:38

The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:

library(tidyverse)

dta.p %>% 
as.data.frame %>% 
rowid_to_column %>% 
mutate(plm.reg = rowid %in% names(plm.reg$residuals))

for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,

# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")

pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>% 
            as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#>    state       year  region reg.re
#>    <fct>       <fct> <fct>  <lgl> 
#>  1 CONNECTICUT 1970  1      T     
#>  2 CONNECTICUT 1971  1      T     
#>  3 CONNECTICUT 1972  1      T     
#>  4 CONNECTICUT 1973  1      T     
#>  5 CONNECTICUT 1974  1      T     
#>  6 CONNECTICUT 1975  1      T     
#>  7 CONNECTICUT 1976  1      T     
#>  8 CONNECTICUT 1977  1      T     
#>  9 CONNECTICUT 1978  1      T     
#> 10 CONNECTICUT 1979  1      T     
#> # ... with 806 more rows

finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,

Grunfeld %>% rowid_to_column %>% 
    mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Thanks, David. This is exactly what I hope to find. Thanks again! — Eric Fail, Feb 20 '18 at 07:19

use model object, e.g. panelmodel, to flag data used

1 Answers1