-1

I have a longitudinal dataset in a data table, similar to the simplified example below:

> head(data)
   Country     ID        Date         Value
1:   AT        AT6306    2012-11-01   16.2
2:   AT        AT6306    2012-11-02   12.2
3:   AT        AT6306    2012-11-03   11.3
4:   AT        AT6306    2012-11-04   14.2
5:   AT        AT6306    2012-11-05   17.3
6:   AT        AT6306    2012-11-06   12.5

> tail(data)
   Country     ID        Date         Value
1:   SE        SE0935    2014-06-25   16.2
2:   SE        SE0935    2014-06-26   12.2
3:   SE        SE0935    2014-06-27   11.3
4:   SE        SE0935    2014-06-28   14.2
5:   SE        SE0935    2014-06-29   17.3
6:   SE        SE0935    2014-06-30   12.5

ID is the panel variable, it is entirely unique without overlaps between countries. The date range, looking only at unique values, ranges from 2012-10-23 to 2014-09-30. Clearly, the ranges of Date are not identical for each ID. Moreover, there may be missing values. In order to have a balanced panel, I want to fill in the gaps of my dataset.

Adapting the answer here, as suggsted by @akron, I do the following:

data2 <- data[, CJ(ID=unique(ID), Date=unique(Date))]
setkey(data2, ID, Date)

data.new <- merge(data, data2, by=c("ID", "Date"), all.y = TRUE)
setkey(data.new, ID, Date)

Using the option all.y = TRUE, R thus adds rows for every missing date in data. However, now all fields other than ID and Date are blank, if the row did not exist before in data. That is, my data looks something like this

> head(data.new)
   Country     ID        Date         Value
1:   NA        AT6306    2012-10-23   NA
2:   NA        AT6306    2012-10-24   NA
3:   NA        AT6306    2012-10-25   NA
4:   NA        AT6306    2012-10-26   NA
5:   NA        AT6306    2012-10-27   NA
6:   NA        AT6306    2012-10-28   NA    

I do want the Value to be NA, since it is missing. However, since the Country does not change for a given ID, I would like the field to be filled.

Community
  • 1
  • 1
altabq
  • 1,322
  • 1
  • 20
  • 33
  • Do these countries have the same range of dates? – akrun Nov 03 '14 at 12:56
  • Yes, the range is 2011-11-01 to 2014-06-30. I'll update the question to incorporate that fact – altabq Nov 03 '14 at 12:59
  • Have a look at [this answer](http://stackoverflow.com/a/25858226/559784) from eddi. – Arun Nov 03 '14 at 13:08
  • So, the first command creates a new data table with using a sequence to create a date range without gaps, the second assigns keys (whereby it also sorts the data table) and the third should map the Values on the newly created frame, correct? I got everything working up to the last command, where I get the error: `Error in bmerge(i <- shallow(i), x, leftcols, rightcols, io <- haskey(i), : typeof x.Date (double) != typeof i.ID (integer)` – altabq Nov 03 '14 at 13:31
  • I should have noted that the ID is actually a factor that can contain letters and numbers. – altabq Nov 03 '14 at 13:33
  • @akrun, I don't see how you came up with your answer from the link I provided. It uses `CJ()` in `i` in a straightforward manner, after setting keys on `DT`. Why use it in `j`...?!? – Arun Nov 03 '14 at 13:35
  • @altabq, this Q very much feels like a duplicate of the one I linked. Read up on `?CJ`, and look at the usage on that post. It's a direct join. Set keys on `data` to `Country, ID, Date`. And use `DT[CJ(...)]` - that'll fill all missing values with NA by default. – Arun Nov 03 '14 at 13:39
  • @altabq, please provide reproducible data (code that we can copy/paste). Akron, what's the code you tried that gave you the wrong result? From what I can see, your and is a roundabout way of accomplishing `DT[CJ(...)]`. – Arun Nov 03 '14 at 13:50
  • @akrun You are right, the answer is almost exactly the same. I managed to create a new table using merge. I am having one slight problem though. I have a couple more fields that give attributes to the entities. So for each ID, there is the name of the entity and some other things. Is there a way to add that to DT2 using CJ? – altabq Nov 03 '14 at 14:39
  • @altabq It would be better if you provide `dput` output. ie. `dput(head(data,20))` . In the example I created, `Country` is character class. – akrun Nov 03 '14 at 14:43
  • @akrun, I would really like to do that but the data is proprietary. I thus cannot share a specific example. Basically, for each ID, I have a number of factor class columns, each row is unique to the ID. In the context `Country`, which for me is factor class: I use `data.new <- merge(data2, data, by=c("ID", "Date"), all.x = TRUE)` to match the two datasets. However, then Country will be `NA` for dates that did not previously exist in dData`. Is it possible to add the `Country` field already to the `data2` data table? `data2 <- data[, CJ(ID=unique(ID), Date=unique(Date)]` – altabq Nov 03 '14 at 15:19
  • @akrun, yes please. Basically, my problem is that by merging with the option `all.x = TRUE`, new rows are added for values that exist in `data2` (the gapless data table) but not in `data` (the original data table, which may contain gaps). Since `data2` is defined via `data2 <- data[, CJ(ID=unique(ID), Date=unique(Date)]`, it has no values for the other attributes of each entity (e.g. `Country`). Hence, the field contains. I think you should be able to reproduce this by leaving one country blank in the `data` data table. – altabq Nov 03 '14 at 15:38
  • @altabq If you check the output I posted, the Country field is there. – akrun Nov 03 '14 at 15:38
  • @altabq I updated the post. Please check if that works for you. – akrun Nov 03 '14 at 16:42

1 Answers1

1
library(data.table)
DT <- data.table(dat)
setkey(DT, Date, Country, ID)
res <- DT[CJ(seq(min(Date), max(Date), by='1 day'), 
                        unique(Country), unique(ID))]

 head(res)
#    Country   ID       Date Value
#1:      AT  935 2012-11-01    NA
#2:      AT 6306 2012-11-01  16.2
#3:      SE  935 2012-11-01    NA
#4:      SE 6306 2012-11-01    NA
#5:      AT  935 2012-11-02    NA
#6:      AT 6306 2012-11-02  12.2

Update

One option you could do is

DT <- data.table(dat)
DT[,CountryID:= paste(Country,ID)]
setkey(DT, Date, CountryID)
DT1 <- DT[CJ(unique(Date), unique(CountryID))][,
      c('Country', 'ID'):=  list(gsub("[ 0-9]", "", CountryID),
               gsub("[^ 0-9]", "", CountryID)),][,-5L]


head(DT1,3)
#     Country    ID       Date Value
#1:      AT  6306 2012-11-01  16.2
#2:      SE   935 2012-11-01    NA
#3:      AT  6306 2012-11-02  12.2

nrow(DT1)
#[1] 24

data

dat <- structure(list(Country = c("AT", "AT", "AT", "AT", "AT", "AT", 
"SE", "SE", "SE", "SE", "SE", "SE"), ID = c(6306L, 6306L, 6306L, 
6306L, 6306L, 6306L, 935L, 935L, 935L, 935L, 935L, 935L), Date = structure(c(15645, 
15646, 15647, 15648, 15649, 15650, 15669, 15670, 15671, 15672, 
15673, 15674), class = "Date"), Value = c(16.2, 12.2, 11.3, 14.2, 
17.3, 12.5, 16.2, 12.2, 11.3, 14.2, 17.3, 12.5)), .Names = c("Country", 
"ID", "Date", "Value"), row.names = c("1:", "2:", "3:", "4:", 
"5:", "6:", "1:1", "2:1", "3:1", "4:1", "5:1", "6:1"), class = "data.frame")
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you! The problem that I have with e.g. country is that I do not add it into DT as `unique`, since the `ID` is already unique across countries. So, I only use `data2 <- data[, CJ(ID=unique(ID), Date=unique(Date)]`. But then I get NAs for the countries. If I do add `Country=unique(Country)`, the matrix gets much larger than necessary, since the length will then be `unique(Date)*unique(ID)*unique(Country)` instead of `unique(Date)*unique(ID)`, which is the intended size. – altabq Nov 03 '14 at 15:41
  • @altabq Based on your code, I get 24 rows and with `NAs` on Country. So, you want to change the NAs or remove the rows from code to get 24 rows? – akrun Nov 03 '14 at 15:47
  • I would like to change the `NAs` to have the matching `Country` in there. Would it be possible to add e.g. non-unique fields already to the `CJ` table? – altabq Nov 03 '14 at 15:51
  • @altabq Are these `IDs` unique for each `Country` or is there overlaps? – akrun Nov 03 '14 at 16:16
  • the `IDs` are all unique. I should have chosen a better example than in my question, since they look more like `AT6306` and `SE0935`. So, `Country` is not needed for merging, but the information should be retained. – altabq Nov 03 '14 at 16:22
  • I will rewrite my question to the one it basically boils down to now. – altabq Nov 03 '14 at 16:23