2

I have an unbalanced panel like the following example:

test <- read.table(
text = "
A   2010-01-01  1   rdm
A   2010-01-10  2   dfg
A   2010-01-14  3   fdgfd
A   2010-02-15  4   fdgfd
A   2010-08-17  5   dg
A   2010-12-19  6   dfg
B   2009-01-01  1   dfg
B   2010-01-01  2   ydg
B   2010-01-10  3   fdgfd
B   2010-01-14  4   dfg
B   2010-02-15  5   dfg
",header=F)
library(data.table)
setDT(test)
names(test) <-  c("ID", "date", "nr", "namecol")

I would like to balance it with regard to date, i.e. every individual (A, B, etc.) has NA rows for the dates where there is no data. I do not know the minimum date per group or the minimum date across groups to begin with. Same with the maximum, but maybe it is faster to just choose a maximum which equals a certain date (as compared to calculating it across groups). The desired output is:

out <- read.table(
text = "
A   2009-01-01  NA  NA
A   2010-01-01  1   rdm
A   2010-01-10  2   dfg
A   2010-01-14  3   fdgfd
A   2010-02-15  4   fdgfd
A   2010-08-17  5   dg
A   2010-12-19  6   dfg
B   2009-01-01  1   dfg
B   2010-01-01  2   ydg
B   2010-01-10  3   fdgfd
B   2010-01-14  4   dfg
B   2010-02-15  5   dfg
B   2010-08-17  NA  NA
B   2010-12-19  NA  NA
",header=F)
setDT(out)
names(out) <-  c("ID", "date", "nr", "namecol")

My data set is very large, so I believe it would be best to do this in data.table (or plyr, reshape2) or something similar suitable.

user3032689
  • 627
  • 1
  • 10
  • 23

1 Answers1

5

We do a cross join (CJ) with unique 'ID', and 'date' of the dataset after setting the key columns as 'ID' and 'date' and then do a join with the original dataset.

setDT(test, key = c("ID", "date"))[CJ(ID, date, unique=TRUE)]
#    ID       date nr namecol
# 1:  A 2009-01-01 NA      NA
# 2:  A 2010-01-01  1     rdm
# 3:  A 2010-01-10  2     dfg
# 4:  A 2010-01-14  3   fdgfd
# 5:  A 2010-02-15  4   fdgfd
# 6:  A 2010-08-17  5      dg
# 7:  A 2010-12-19  6     dfg
# 8:  B 2009-01-01  1     dfg
# 9:  B 2010-01-01  2     ydg
#10:  B 2010-01-10  3   fdgfd
#11:  B 2010-01-14  4     dfg
#12:  B 2010-02-15  5     dfg
#13:  B 2010-08-17 NA      NA
#14:  B 2010-12-19 NA      NA

data

test <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B"), date = structure(c(14610, 14619, 14623, 14655, 
14838, 14962, 14245, 14610, 14619, 14623, 14655), class = "Date"), 
nr = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L), namecol = c("rdm", 
"dfg", "fdgfd", "fdgfd", "dg", "dfg", "dfg", "ydg", "fdgfd", 
"dfg", "dfg")), .Names = c("ID", "date", "nr", "namecol"),
 row.names = c(NA, -11L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • damn this was fast! But it also sets the following to NA: `A 2010-01-01 NA NA`, so something must be wrong? – user3032689 Sep 11 '16 at 11:46
  • @user3032689 In your 'out' there are 13 rows (while here I get 14 rows) without the `A 2009-01-01 NA NA`. So you don't want that combination? – akrun Sep 11 '16 at 11:50
  • Yes, I do. The intended process was to compare all the dates of A with the ones of B (and potential other individuals). Then for each individual, add all the dates which are missing from the other individuals' dates. In this example, A does cover all dates from B, except `2009-01-01`, so this must be added with NAs for the respective cols. With regard to B, B has all dates of A except for `2010-08-17` and `2010-12-19`. Therefore those dates must be added in this case with NAs. – user3032689 Sep 11 '16 at 11:58
  • My idea therefore was to calc a min and max date across groups first and then to add all missing dates with NAs for each individual in the 2nd step. I don't know, but maybe there's a faster way? – user3032689 Sep 11 '16 at 12:00
  • if you are doing a sequence between min max it will be more no of rows – akrun Sep 11 '16 at 12:02
  • ok, hmm, in the end I was looking for the same nr of rows (dates) per each individual. I know it will be a lot more rows in total then. – user3032689 Sep 11 '16 at 12:06
  • @user3032689 I think you created the dataset with the first row as column name – akrun Sep 11 '16 at 12:06