1

I had a weird problem in plm() function. Below is the code:

library(data.table)
library(tidyverse)
library(plm)


#Data Generation
n <- 500
set.seed(75080)

z   <- rnorm(n)
w   <- rnorm(n)
x   <- 5*z + 50
y   <- -100*z+ 1100 + 50*w
y   <- 10*round(y/10)
y   <- ifelse(y<200,200,y)
y   <- ifelse(y>1600,1600,y)
dt1 <- data.table('id'=1:500,'sat'=y,'income'=x,'group'=rep(1,n))

z   <- rnorm(n)
w   <- rnorm(n)
x   <- 5*z + 80
y   <- -80*z+ 1200 + 50*w
y   <- 10*round(y/10)
y   <- ifelse(y<200,200,y)
y   <- ifelse(y>1600,1600,y)
dt2 <- data.table('id'=501:1000,'sat'=y,'income'=x,'group'=rep(2,n))

z   <- rnorm(n)
w   <- rnorm(n)
x   <- 5*z + 30
y   <- -120*z+ 1000 + 50*w
y   <- 10*round(y/10)
y   <- ifelse(y<200,200,y)
y   <- ifelse(y>1600,1600,y)
dt3 <- data.table('id'=1001:1500,'sat'=y,'income'=x,'group'=rep(3,n))

dtable <- merge(dt1    ,dt2, all=TRUE)
dtable <- merge(dtable ,dt3, all=TRUE)


# Model 
dtable_p <- pdata.frame(dtable, index = "group")

mod_1 <- plm(sat ~ income, data = dtable_p,model = "pooling")

Error in [.data.frame(x, , which) : undefined columns selected

I checked all possibilities but I can not figure out why it gives me an error. the columns'names are correct, why R said undefined columns??? Thank you!

Follow up: I add another data set test as the @StupidWolf use to prove

data("Produc", package = "plm")
form <- log(gsp) ~ log(pc) 
Produc$group <-  Produc$region
pProduc <- pdata.frame(Produc, index = "group")

Produc$group <- rep(1:48, each = 17)

summary(plm(form, data = pProduc, model = "pooling"))
>Error in `[.data.frame`(x, , which) : undefined columns selected
Helix123
  • 3,502
  • 2
  • 16
  • 36
Steve
  • 183
  • 1
  • 2
  • 10
  • It seems `plm` is expecting a data.frame for the `data`. `plm(sat ~ income, data = as.data.frame(dtable_p), model = "pooling")` should work. – kangaroo_cliff Nov 25 '19 at 00:05
  • No, it doesn't work... – Steve Nov 25 '19 at 00:23
  • It did work for me; there were warning messages, but that's related to the data. – kangaroo_cliff Nov 25 '19 at 00:51
  • Error in cor(y, haty) : 'x' must be numeric, I got this error message. Usually it is no need to convert data set into data.frame in plm() function. But I don't know why it doesn't work only for this data set. I tested for other data set, it all works. Weird... – Steve Nov 25 '19 at 00:54
  • I run `summary(mod_1)`, throw another error. _summary(mod_1)_ `Error in cor(y, haty) : 'x' must be numeric ` – Steve Nov 25 '19 at 00:58
  • 1
    This is fixed in plm version 2.2-2 on CRAN. – Helix123 Feb 22 '20 at 10:25

1 Answers1

2

This is extremely weird, the answer is index must not be named "group".

I suspect somewhere in the plm function, it must be adding a "group" to your data.frame.

We can use the example dataset

data("Produc", package = "plm")
form <- log(gsp) ~ log(pc) 
Produc$group = Produc$region
pProduc <- pdata.frame(Produc, index = c("group"))
summary(plm(form, data = pProduc, model = "random"))
Error in `[.data.frame`(x, , which) : undefined columns selected

Using the "region" column from which I copied, it works:

pProduc <- pdata.frame(Produc, index = c("region"))
summary(plm(form, data = pProduc, model = "random"))

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = form, data = pProduc, model = "random")

Unbalanced Panel: n = 9, T = 51-136, N = 816

Effects:
                  var std.dev share
idiosyncratic 0.03691 0.19213 0.402
individual    0.05502 0.23457 0.598
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8861  0.9012  0.9192  0.9157  0.9299  0.9299 

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.68180 -0.11014  0.00977 -0.00039  0.13815  0.45491 

Coefficients:
             Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept) -1.099088   0.138395  -7.9417 1.994e-15 ***
log(pc)      1.100102   0.010623 103.5627 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    459.71
Residual Sum of Squares: 30.029
R-Squared:      0.93468
Adj. R-Squared: 0.9346
Chisq: 11647.6 on 1 DF, p-value: < 2.22e-16

For your example, just rename the column "group" and also set it as a factor to avoid the other errors. (For "pooling" it should be treated a categorical not numeric).

dtable <- merge(dt1    ,dt2, all=TRUE)
dtable <- merge(dtable ,dt3, all=TRUE)
dtable$group = factor(dtable$group)
colnames(dtable)[4] = "GROUP"
dtable_p <- pdata.frame(dtable, index = "GROUP")
summary(plm(sat ~ income, data = dtable_p,method="pooling"))
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • OMG~ You got the answer! I tested again with your data set. Seems it proved that you are correct. The word _group_ must be a key in the source code. I attached my code for the data you use(17 periods panel data). Thank you! – Steve Nov 25 '19 at 16:05
  • 1
    No problem :) Must be really frustrating for you. Not the first package I see having this kind of (scary) bugs. I had to try with the working example to be sure. – StupidWolf Nov 25 '19 at 16:14
  • This is fixed in plm version 2.2-2 on CRAN. – Helix123 Feb 22 '20 at 10:25