Conditional merge, based an event happening between two panel observations

Question

I have a panel dataset: panel and a dataset with a list of events: Events. For the panel dataset, an equal panelID shows that two observations belong together.

panelID = c(1:50)   
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")

n <- 2

library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
                 country = rep(sample(country, length(panelID), replace = T), each = n),
                 year = c(replicate(length(panelID), sample(year, n))),
                 some_NA = sample(0:5, 6),                                             
                 some_NA_factor = sample(0:5, 6),         
                 norm = round(runif(100)/10,2),
                 Income = round(rnorm(10,-5,5),2),
                 Happiness = sample(10,10),
                 Sex = round(rnorm(10,0.75,0.3),2),
                 Age = sample(100,100),
                 Educ = round(rnorm(10,0.75,0.3),2))        
Panel[, uniqueID := .I]                                                                        # Creates a unique ID     
Panel[Panel == 0] <- NA    


Events <- fread(
"Event_Type  country year   
A   NLD   2005
C   NLD   2004       
A   GBR   2006
B   GBR   2003   
A   GRC   2002             
D   GRC   2007",
header = TRUE)

================================================================================ EDIT:

Events <- fread(
"Event_Type  country year   
A   NLD   2005
A   NLD   2004       
A   GBR   2006
A   GBR   2003   
A   GRC   2002             
A   GRC   2007",
header = TRUE)

EDITED DESIRED OUTCOME:

panleID country year 2002  2003  2004 2005 2006 2007 
1       NLD     2002 NA    NA    1    1    NA   NA 
1       NLD     2006 NA    NA    1    1    NA   NA

=======================================================================

I would like the value in the column Event_Type to be added to Panel, if the year of the event is between two panel observation (and in the same country).

As an example, let's take the following panel observation:

panleID country year
1       NLD     2002
1       NLD     2006

Panel will get 4 extra columns A to D. The column A, will get a 1 in the column if the event in country NLD in the year 2005 (first line Events, takes place in one of or between the two years. Since it does this results in the following:

panleID country year A  B  C  D 
1       NLD     2002 1  NA NA NA
1       NLD     2006 1  NA NA NA

I know that merging with the same year goes as follows:

merge(Panel, dcast(Events, iso + country ~ Event_Type),
      by = c("country", "year"))

But how should I make the merge if I want the values to be equal to or in between the two panelID years?

in your example, in 2004 thare is event `C` in NLD also, that is between 2002 and 2006 of panleID ... why A == 1, but C == NA in your result? — Wimpel, Nov 16 '19 at 12:56
I just tried to give a very simple example. Apparently a bit too simple ;) C should also be 1 in that case. My apologies! — Tom, Nov 16 '19 at 12:58

Wimpel · Accepted Answer · 2019-11-16T13:10:45.320

Here is a go at your problem using data.table The code can be shortened, but I always find it useful (expecially on SO) to show all steps in between for easy error-checking and validation.

#first, summarise Panel, to get the time-span of the panelID
Panel.short <- Panel[, .(country = unique(country), 
                         start = min(year), 
                         end = max(year) ), 
                     by = .(panelID)]
#    panelID country start  end
# 1:       1     NLD  2002 2006

#perform left non-equi join
Panel.short.joined <- Events[ Panel.short, on =.(country, year >= start, year <= end), mult = "all"][]
#    Event_Type country year year.1 panelID
# 1:          A     NLD 2002   2006       1
# 2:          C     NLD 2002   2006       1

#cast to wide
Panel.final <- dcast( Panel.short.joined, 
       panelID + country ~ Event_Type, 
       fun.aggregate = length )
#    panelID country A C
# 1:       1     NLD 1 1

#perform update join on the original Panel
Panel[, `:=`(A=0, B=0, C=0, D=0)][ 
  Panel.final, 
  `:=`( A = i.A, C = i.C),   # <- add B = i.B and D = i.D here 
  on = .( panelID )][]
#    panelID country year A B C D
# 1:       1     NLD 2002 1 0 1 0
# 2:       1     NLD 2006 1 0 1 0

updated the last line, to initialise A, B, C and D before the actual update-join. — Wimpel, Nov 16 '19 at 13:09
I have been trying to adapt you answer to the EDIT Events and desired output, (because I made a mistake in the example) but I am doing something wrong. Is there anyway you could show me how your answer looks like for the EDIT? — Tom, Nov 25 '19 at 17:06

score 2 · Answer 2 · answered Nov 16 '19 at 17:11

2

This is similar to @Wimpel but changes the order to:

Cast Events to wide
Update by reference the year range by panelID
Non-equi update join

# cast Event 
Events_cast <- dcast(Events, country + year~Event_Type, length)

# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]

# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['Event_Type']]))

# non-equi update join
Panel[Events_cast,
      on = .(country,
             start <= year,
             end >= year),
      (cols) := mget(cols)]

#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]

Panel

answered Nov 16 '19 at 17:11

Cole

11,130
1
9
24

Thank you for your answer! I have a quick question. I have been trying to adapt you answer to the EDIT `Events` and desired output, (because I made a mistake in the example) but I am doing something wrong. is there anyway you could show me how your answer looks like for the EDIT? – Tom Nov 25 '19 at 17:04
I have tried to change it to `Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")`, which looks good. But at the `# non-equi update join` I get the error: `Error in [.data.table (Panel, Events, on = .(country, : LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.` – Tom Nov 26 '19 at 07:26
I made a new post for clarity: https://stackoverflow.com/questions/59047386/dcast-merge-based-on-a-column-with-a-value-within-a-certain-range – Tom Nov 26 '19 at 09:13

score 0 · Answer 3 · answered Nov 16 '19 at 22:22

I would consider making use of 'between' and '.SD'. Having trouble following your example, so typically:

DT[between(startYear, endYear, incbounds=FALSE,][, dcast(,.SD, cat1 ~ cat2 ...)]

Note: By passing data.table to cast with .SD, you can use i to subset further.

Conditional merge, based an event happening between two panel observations

3 Answers3

Linked