1

I am using R and have a long data set as the one outlined below:

Date           ID     Status
2014-10-01     12      1
2015-04-01     12      1
2015-07-01     12      1
2015-09-01     12      1
2015-11-01     12      0
2016-01-01     12      0
2016-05-01     12      0
2016-08-01     12      1
2017-03-01     12      1
2017-05-01     12      1
2014-10-01     13      1
2015-04-01     13      1
2015-07-01     13      0
2015-11-01     14      0
2016-01-01     14      0
...

My goal is to create a "balanced" data i.e. each ID should occur for each of the 10 dates. The variable "Status" for the initially non-occurring observations should be labeled as N/A. In other words, the outcome should look like this:

Date           ID     Status
2014-10-01     12      1
2015-04-01     12      1
2015-07-01     12      1
2015-09-01     12      1
2015-11-01     12      0
2016-01-01     12      0
2016-05-01     12      0
2016-08-01     12      1
2017-03-01     12      1
2017-05-01     12      1
2014-10-01     13      1
2015-04-01     13      1
2015-07-01     13      N/A
2015-09-01     13      N/A
2015-11-01     13      N/A
2016-01-01     13      N/A
2016-05-01     13      N/A
2016-08-01     13      N/A
2017-03-01     13      N/A
2017-05-01     13      N/A
2014-10-01     14      N/A
2015-04-01     14      N/A
2015-07-01     14      N/A
2015-09-01     14      N/A
2015-11-01     14      0
2016-01-01     14      0
2016-05-01     14      N/A
2016-08-01     14      N/A
2017-03-01     14      N/A
2017-05-01     14      N/A
...

Thank you for your help!

Neicooo
  • 197
  • 1
  • 9
  • some inspiration: [*Merge Panel data to get balanced panel data*](https://stackoverflow.com/q/35610652/2204410) – Jaap Feb 17 '18 at 11:14
  • My idea would be to create another data set with the 10 dates times the number of observations of the original data. Merge the data sets to obtain the desired output. Not sure how that would have to be done in terms of code. – Neicooo Feb 17 '18 at 12:44

2 Answers2

1

Here is an approach using tidyverse:

library(tidyverse)
df %>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df) -> df1 #join the original data frame and save to object df1

or save to original object (thanks to Renu's comment):

df %<>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df)

equivalent is:

df %>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df) -> df

The result:

   ID       Date Status
1  12 2014-10-01      1
2  12 2015-04-01      1
3  12 2015-07-01      1
4  12 2015-09-01      1
5  12 2015-11-01      0
6  12 2016-01-01      0
7  12 2016-05-01      0
8  12 2016-08-01      1
9  12 2017-03-01      1
10 12 2017-05-01      1
11 13 2014-10-01      1
12 13 2015-04-01      1
13 13 2015-07-01      0
14 13 2015-09-01     NA
15 13 2015-11-01     NA
16 13 2016-01-01     NA
17 13 2016-05-01     NA
18 13 2016-08-01     NA
19 13 2017-03-01     NA
20 13 2017-05-01     NA
21 14 2014-10-01     NA
22 14 2015-04-01     NA
23 14 2015-07-01     NA
24 14 2015-09-01     NA
25 14 2015-11-01      0
26 14 2016-01-01      0
27 14 2016-05-01     NA
28 14 2016-08-01     NA
29 14 2017-03-01     NA
30 14 2017-05-01     NA

the data:

> dput(df)
structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 1L, 2L, 3L, 5L, 6L), .Label = c("2014-10-01", "2015-04-01", 
"2015-07-01", "2015-09-01", "2015-11-01", "2016-01-01", "2016-05-01", 
"2016-08-01", "2017-03-01", "2017-05-01"), class = "factor"), 
    ID = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
    13L, 13L, 13L, 14L, 14L), Status = c(1L, 1L, 1L, 1L, 0L, 
    0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Date", 
"ID", "Status"), class = "data.frame", row.names = c(NA, -15L
))
missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thanks, the code runs fine, but nothing changes in my data set. Am I missing anything? – Neicooo Feb 17 '18 at 16:40
  • @Neicooo you need to assign it to an object, check edit. You can assign to a new object like `df1` or rewrite the original object. – missuse Feb 17 '18 at 17:12
  • 1
    @Neicooo You could also change the first `%>%` to `%<>%` – IceCreamToucan Feb 17 '18 at 17:26
  • Now, I receive the new object, but it is an exact copy of the original one with around 17k observations. It seems the expansion of creating the missing rows is not working. – Neicooo Feb 18 '18 at 15:06
  • @Neicooo I am confidant there is some misunderstanding between my answer and your interpretation but I can not pin point it without an additional data example since your expected output matches mine. Could you try to replicate the example from my answer, Does it work, do your receive the same result as I do? – missuse Feb 18 '18 at 15:26
0

The following worked for me:

df_b <- data.frame(date = rep(unique(df$date), length(unique(df$id))),
               id = rep(unique(df$id), each = length(unique(df$date))))

balanced_data <- left_join(df_b, df)
Neicooo
  • 197
  • 1
  • 9