Here's an example that shows that the function make.pbalanced
from plm
is much slower than a manual solution using a couple lines of dplyr
.
I create a panel data set of 20,000 units over 20 years, and randomly drop 10% of rows. I then rebalance it two ways: first using make.pbalanced
, and then via a "manual" solution using crossing
, unique
and left_join
.
The plm
solution takes 1.2m, while the manual solution takes 1.4s.
I'm curious as to why make.pbalanced
is so slow. If I push the number of individuals to 50,000 it crashes my computer. These are pretty small sizes for typical panel data sets (for example longitudinal household surveys).
Here's the code:
library(tidyr)
library(dplyr)
library(plm)
#create a full unbalanced panel data set
id = (1:20000)
year = (1980:2000)
full_panel = crossing(id, year)
full_panel$x = rnorm(nrow(full_panel))
#take 90% sample to unbalance it
panel_missing = sample_n(full_panel, round(nrow(full_panel)*0.9), replace = FALSE)
#balance panel using make.pbalanced
start_time = Sys.time()
panel_balanced = pdata.frame(panel_missing, index = c("id", "year")) %>%
make.pbalanced()
end_time = Sys.time()
print(end_time - start_time)
#balance panel manually
start_time = Sys.time()
id = unique(panel_missing$id)
year = unique(panel_missing$year)
panel_balanced = crossing(id, year) %>%
left_join(panel_missing)
end_time = Sys.time()
print(end_time - start_time)