why is make.pbalanced from plm so slow on medium-sized datasets?

Question

Here's an example that shows that the function make.pbalanced from plm is much slower than a manual solution using a couple lines of dplyr.

I create a panel data set of 20,000 units over 20 years, and randomly drop 10% of rows. I then rebalance it two ways: first using make.pbalanced, and then via a "manual" solution using crossing, unique and left_join.

The plm solution takes 1.2m, while the manual solution takes 1.4s.

I'm curious as to why make.pbalanced is so slow. If I push the number of individuals to 50,000 it crashes my computer. These are pretty small sizes for typical panel data sets (for example longitudinal household surveys).

Here's the code:

library(tidyr)
library(dplyr)
library(plm)

#create a full unbalanced panel data set
id = (1:20000)
year = (1980:2000)
full_panel = crossing(id, year) 
full_panel$x = rnorm(nrow(full_panel))

#take 90% sample to unbalance it
panel_missing = sample_n(full_panel, round(nrow(full_panel)*0.9), replace = FALSE)

#balance panel using make.pbalanced
start_time = Sys.time()
panel_balanced = pdata.frame(panel_missing, index = c("id", "year")) %>% 
  make.pbalanced()
end_time = Sys.time()
print(end_time - start_time)

#balance panel manually
start_time = Sys.time()
id = unique(panel_missing$id)
year = unique(panel_missing$year)
panel_balanced = crossing(id, year) %>% 
  left_join(panel_missing)
end_time = Sys.time()
print(end_time - start_time)

Internally, `make.pbalanced` makes the data consecutive first (via `make.pconsecutive`) and then balanced. The approach is a little different and less efficient. — Helix123, Apr 17 '20 at 09:57
Thanks. So an example of where the difference would matter would be if one year were missing for every individual, my method would not include that year. But then I could just set `year = (min(unique(panel_missing$year), max(unique(panel_missing$year))` and it would make them equivalent. — sam, Apr 20 '20 at 19:57
I did not look at your code closely enough to tell. Behaviour of `make.pbalanced` can be controlled by argument `balance.type` and the non-default values are faster. Also, your implementation does not take care of the index which is a property of `pdata.frame`s. However, I think `make.pbalanced` could be made more efficient in default mode. — Helix123, Apr 20 '20 at 21:02

why is make.pbalanced from plm so slow on medium-sized datasets?

0 Answers0