3

Suppose I have the following unbalance pandel data:

unbalanced.panel = structure(list(firm = c("A", "A", "A", "A", "B", "B", "A", "A", 
"B", "C", "C"), ind = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1), year = c(2010, 
2011, 2012, 2013, 2011, 2013, 2011, 2012, 2010, 2012, 2013), 
    charac1 = c("x", "x", "x", "x", "y", "y", "z", "z", "g", 
    "h", "h"), var1 = c(11, 12, 13, 14, 15, 18, 15, 29, 31, 13, 
    2)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"))
   firm    ind  year charac1  var1
   <chr> <dbl> <dbl> <chr>   <dbl>
 1 A         1  2010 x          11
 2 A         1  2011 x          12
 3 A         1  2012 x          13
 4 A         1  2013 x          14
 5 B         2  2011 y          15
 6 B         2  2013 y          18
 7 A         2  2011 z          15
 8 A         2  2012 z          29
 9 B         1  2010 g          31
10 C         1  2012 h          13
11 C         1  2013 h           2

Where each unique group (individual) is identified by the combination of firm and ind, i.e. individual "A1" is different from "A2" individual. And the time index is given by year variable.

What I want is to balance the panal data (index = (individual = firm-ind, time = year)) filling the implicit missing gaps with NA's.

The desired result is as follows:

   firm    ind  year charac1  var1
   <chr> <dbl> <dbl> <chr>   <dbl>
 1 A         1  2010 x          11
 2 A         1  2011 x          12
 3 A         1  2012 x          13
 4 A         1  2013 x          14
 5 B         2  2010 y          NA
 6 B         2  2011 y          15
 7 B         2  2012 y          NA
 8 B         2  2013 y          18
 9 A         2  2010 z          NA
10 A         2  2011 z          15
11 A         2  2012 z          29
12 A         2  2013 z          NA
13 B         1  2010 g          31
14 B         1  2011 g          NA
15 B         1  2012 g          NA
16 B         1  2013 g          NA
17 C         1  2010 h          NA
18 C         1  2011 h          NA
19 C         1  2012 h          13
20 C         1  2013 h           2

I tried to use plm::make.pbalanced(unbalanced.panel, balance.type = "fill") but I got the following error:

Error in mode<-(tmp, value = id_orig_typeof) : invalid to change the storage mode of a factor

I even tried to used tidyr::complete(), but it doesn't help to achieve my desired balanced panel.

These are my request: When an unique individual (firm-ind) is missing a year row, time-variant variables (var1) must be filled with NA, but time-invariant variables such as characteristics (charact1) should be filled with the unique value.

What's the problem with tidyr::complete() approach? It doesn't allow me to differentiate between time-invariant and time-variant variables to be filled or expanded. And it doesn't identify uniquely individual-index (in this case firm-ind).

unbalanced.panel >%>
  tidyr::complete(firm, year, nesting(var1))

Above code makes appear a new individual "C2" and fills with NA the time-invariant variables.

Helix123
  • 3,502
  • 2
  • 16
  • 36
Cristhian
  • 361
  • 3
  • 12

2 Answers2

4

We can use complete from the tidyr package. The key is to set nesting properly.

library(dplyr)
library(tidyr)

balanced.panel <- unbalanced.panel %>%  
  complete(nesting(firm, ind, charac1), year = full_seq(year, period = 1))
balanced.panel
# # A tibble: 20 x 5
#  firm    ind charac1  year  var1
#  <chr> <dbl> <chr>   <dbl> <dbl>
#  1 A         1 x        2010    11
#  2 A         1 x        2011    12
#  3 A         1 x        2012    13
#  4 A         1 x        2013    14
#  5 A         2 z        2010    NA
#  6 A         2 z        2011    15
#  7 A         2 z        2012    29
#  8 A         2 z        2013    NA
#  9 B         1 g        2010    31
# 10 B         1 g        2011    NA
# 11 B         1 g        2012    NA
# 12 B         1 g        2013    NA
# 13 B         2 y        2010    NA
# 14 B         2 y        2011    15
# 15 B         2 y        2012    NA
# 16 B         2 y        2013    18
# 17 C         1 h        2010    NA
# 18 C         1 h        2011    NA
# 19 C         1 h        2012    13
# 20 C         1 h        2013     2
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    How can I use ```full_seq(year, period = 1)``` when year is a ```Date``` class? I mean, in my case year is a yearly Date variable with "%Y-%m-%d" format, using ```full_seq(year, period = 1)``` doesn't fill the complete date window. Besides, I'd like to generalize this approach with other frequency Date variable, such as daily, monthly or quaterly data. Thanks – Cristhian Feb 24 '20 at 01:53
  • 1
    You did not show your year column as a date class in your example, so it is difficult to answer. You can always use seq.Date to generate the date sequence you want. It does not need to use full_seq. – www Feb 24 '20 at 02:26
  • Had the same problem as @Cristhian, but my year variable was an "numeric". – Jorge Paredes Sep 09 '21 at 03:27
2

plm uses two dimensions for panel data (individual, time). First, make a variable that reflects the individual dimension by combining the two variables you have to refer to an individual, let us call this variable idvar. For conciseness, let us call the data set u. For plm's data manipulating functions, it is easier to work on a pdata.frame. Combining all those hints in code, use something like the following:

u <- unbalanced.panel
u$idvar <- paste(u$firm, u$ind)
pu <- pdata.frame(u, index = c("idvar", "year"))
make.pbalanced(pu, balance.type = "fill")

#         firm ind year charac1 var1 idvar
# A 1-2010    A   1 2010       x   11   A 1
# A 1-2011    A   1 2011       x   12   A 1
# A 1-2012    A   1 2012       x   13   A 1
# A 1-2013    A   1 2013       x   14   A 1
# A 2-2010 <NA>  NA 2010    <NA>   NA   A 2
# A 2-2011    A   2 2011       z   15   A 2
# A 2-2012    A   2 2012       z   29   A 2
# A 2-2013 <NA>  NA 2013    <NA>   NA   A 2
# B 1-2010    B   1 2010       g   31   B 1
# B 1-2011 <NA>  NA 2011    <NA>   NA   B 1
# B 1-2012 <NA>  NA 2012    <NA>   NA   B 1
# B 1-2013 <NA>  NA 2013    <NA>   NA   B 1
# B 2-2010 <NA>  NA 2010    <NA>   NA   B 2
# B 2-2011    B   2 2011       y   15   B 2
# B 2-2012 <NA>  NA 2012    <NA>   NA   B 2
# B 2-2013    B   2 2013       y   18   B 2
# C 1-2010 <NA>  NA 2010    <NA>   NA   C 1
# C 1-2011 <NA>  NA 2011    <NA>   NA   C 1
# C 1-2012    C   1 2012       h   13   C 1
# C 1-2013    C   1 2013       h    2   C 1
Helix123
  • 3,502
  • 2
  • 16
  • 36
  • Then, how would you fill the NA gererated in charac1, firm, ind? I am trying `fill()` but get's me back to the unbalanced panel. – Jorge Paredes Sep 09 '21 at 03:54
  • 1
    You can use the usual data frame insert methods (in conjunction with subsetting) of base R. They work also for a pdata.frame. `fill()` is not from base R, not from plm. Likely you mean `tidyr::fill()` - I do not think it is aware of the special pdata.frame structure and it might destroy what you achieved; check its documentation (I haven't). You can convert the pdata.frame in a data frame by `plm::as.data.frame` and then work with anything that works on a data frame. – Helix123 Sep 09 '21 at 06:36