Create balanced data set

Question

I am using R and have a long data set as the one outlined below:

Date           ID     Status
2014-10-01     12      1
2015-04-01     12      1
2015-07-01     12      1
2015-09-01     12      1
2015-11-01     12      0
2016-01-01     12      0
2016-05-01     12      0
2016-08-01     12      1
2017-03-01     12      1
2017-05-01     12      1
2014-10-01     13      1
2015-04-01     13      1
2015-07-01     13      0
2015-11-01     14      0
2016-01-01     14      0
...

My goal is to create a "balanced" data i.e. each ID should occur for each of the 10 dates. The variable "Status" for the initially non-occurring observations should be labeled as N/A. In other words, the outcome should look like this:

Date           ID     Status
2014-10-01     12      1
2015-04-01     12      1
2015-07-01     12      1
2015-09-01     12      1
2015-11-01     12      0
2016-01-01     12      0
2016-05-01     12      0
2016-08-01     12      1
2017-03-01     12      1
2017-05-01     12      1
2014-10-01     13      1
2015-04-01     13      1
2015-07-01     13      N/A
2015-09-01     13      N/A
2015-11-01     13      N/A
2016-01-01     13      N/A
2016-05-01     13      N/A
2016-08-01     13      N/A
2017-03-01     13      N/A
2017-05-01     13      N/A
2014-10-01     14      N/A
2015-04-01     14      N/A
2015-07-01     14      N/A
2015-09-01     14      N/A
2015-11-01     14      0
2016-01-01     14      0
2016-05-01     14      N/A
2016-08-01     14      N/A
2017-03-01     14      N/A
2017-05-01     14      N/A
...

Thank you for your help!

some inspiration: [*Merge Panel data to get balanced panel data*](https://stackoverflow.com/q/35610652/2204410) — Jaap, Feb 17 '18 at 11:14
My idea would be to create another data set with the 10 dates times the number of observations of the original data. Merge the data sets to obtain the desired output. Not sure how that would have to be done in terms of code. — Neicooo, Feb 17 '18 at 12:44

missuse · Answer 1 · 2018-02-17T17:30:44.277

Here is an approach using tidyverse:

library(tidyverse)
df %>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df) -> df1 #join the original data frame and save to object df1

or save to original object (thanks to Renu's comment):

df %<>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df)

equivalent is:

df %>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df) -> df

The result:

   ID       Date Status
1  12 2014-10-01      1
2  12 2015-04-01      1
3  12 2015-07-01      1
4  12 2015-09-01      1
5  12 2015-11-01      0
6  12 2016-01-01      0
7  12 2016-05-01      0
8  12 2016-08-01      1
9  12 2017-03-01      1
10 12 2017-05-01      1
11 13 2014-10-01      1
12 13 2015-04-01      1
13 13 2015-07-01      0
14 13 2015-09-01     NA
15 13 2015-11-01     NA
16 13 2016-01-01     NA
17 13 2016-05-01     NA
18 13 2016-08-01     NA
19 13 2017-03-01     NA
20 13 2017-05-01     NA
21 14 2014-10-01     NA
22 14 2015-04-01     NA
23 14 2015-07-01     NA
24 14 2015-09-01     NA
25 14 2015-11-01      0
26 14 2016-01-01      0
27 14 2016-05-01     NA
28 14 2016-08-01     NA
29 14 2017-03-01     NA
30 14 2017-05-01     NA

the data:

> dput(df)
structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 1L, 2L, 3L, 5L, 6L), .Label = c("2014-10-01", "2015-04-01", 
"2015-07-01", "2015-09-01", "2015-11-01", "2016-01-01", "2016-05-01", 
"2016-08-01", "2017-03-01", "2017-05-01"), class = "factor"), 
    ID = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
    13L, 13L, 13L, 14L, 14L), Status = c(1L, 1L, 1L, 1L, 0L, 
    0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Date", 
"ID", "Status"), class = "data.frame", row.names = c(NA, -15L
))

Thanks, the code runs fine, but nothing changes in my data set. Am I missing anything? — Neicooo, Feb 17 '18 at 16:40
@Neicooo you need to assign it to an object, check edit. You can assign to a new object like `df1` or rewrite the original object. — missuse, Feb 17 '18 at 17:12
Now, I receive the new object, but it is an exact copy of the original one with around 17k observations. It seems the expansion of creating the missing rows is not working. — Neicooo, Feb 18 '18 at 15:06
@Neicooo I am confidant there is some misunderstanding between my answer and your interpretation but I can not pin point it without an additional data example since your expected output matches mine. Could you try to replicate the example from my answer, Does it work, do your receive the same result as I do? — missuse, Feb 18 '18 at 15:26

score 0 · Accepted Answer · answered Feb 18 '18 at 15:15

0

The following worked for me:

df_b <- data.frame(date = rep(unique(df$date), length(unique(df$id))),
               id = rep(unique(df$id), each = length(unique(df$date))))

balanced_data <- left_join(df_b, df)

answered Feb 18 '18 at 15:15

Neicooo

197
1
9

Create balanced data set

2 Answers2