4

I have two datasets, and what I want may be loosely termed as "outer join on a non-key variable".

Here are the datasets

Dataset 1

oc  oc2 state_id    r_state 
A011    A01 1808    1.00    
A011    A01 1810    0.50    
A012    A01 1810    0.50    
A011    A01 1814    0.33    
A012    A01 1814    0.33    
A013    A01 1814    0.33    

Dataset 2

oc  r_country
A011    0.62
A012    0.14
A013    0.24

what I want as output is following:

oc  oc2 state_id    r_state r_country
A011    A01 1808    1.00    0.62
A012    A01 1808    NA      0.14
A013    A01 1808    NA      0.24
A011    A01 1810    0.50    0.62
A012    A01 1810    0.50    0.14
A013    A01 1810    NA      0.24
A011    A01 1814    0.33    0.62
A012    A01 1814    0.33    0.62
A013    A01 1814    0.33    0.24

Note how the row numbers 2, 3 and 6 has been introduced "artificially". My question is how to do this in R. If I merge by oc, merge function will not create these rows. If I merge by oc2, it will create unnecessary extra rows. Note that oc2 is just a higher level coding of oc. Below are few lines of code to get the above datasets in data.tables. Note that these are sample datasets and actual data contain around 50 different oc2 and each oc2 can have 1 to 9 oc in it. Also, there are 47 different states.

DT1 = data.table(oc = c('A011','A011','A012','A011','A012','A013'),
                 oc2 = rep('A01',6),
                 state_id = c(1808,1810,1810,1814,1814,1814),
                 r_state = c(1, 0.5,0.5,0.33,0.33,0.33)
                )

DT2 = data.table(oc = c('A011','A012','A013'), 
                  r_country = c(0.62, 0.14, 0.24)
                )

Though I am using data.table, base data.frame solution will also do.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Gaurav Singhal
  • 998
  • 2
  • 10
  • 25

2 Answers2

5

Using:

library(zoo) # for the 'na.locf'-function

DT1[CJ(oc = oc, state_id = state_id, unique = TRUE), on = .(oc, state_id)
    ][order(state_id), oc2 := na.locf(oc2), by = 'state_id'
      ][DT2, on = 'oc', r_country := r_country][order(state_id)]

gives:

     oc oc2 state_id r_state r_country
1: A011 A01     1808    1.00      0.62
2: A012 A01     1808      NA      0.14
3: A013 A01     1808      NA      0.24
4: A011 A01     1810    0.50      0.62
5: A012 A01     1810    0.50      0.14
6: A013 A01     1810      NA      0.24
7: A011 A01     1814    0.33      0.62
8: A012 A01     1814    0.33      0.14
9: A013 A01     1814    0.33      0.24

Per @Frank's suggestion, you can also do it without using na.locf from the zoo-package:

DT1[CJ(oc = oc, state_id = state_id, unique = TRUE), on = .(oc, state_id)
    ][DT2, on = .(oc), r_country := i.r_country][DT1, on = .(state_id), oc2 := i.oc2][]
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    Thanks, you have implemented the "process" I was thinking in seconds. I never knew about `CJ` and `on`. Where do I read about them? – Gaurav Singhal Dec 20 '17 at 08:25
  • Also, `on = .(oc, state_id)` throw an error for me. I had to use `on = c('oc', 'state_id')`. May be I have older version of `data.table` – Gaurav Singhal Dec 20 '17 at 13:03
  • @GauravSinghal I'm using version *1.10.4-3*; what version are you using? – Jaap Dec 20 '17 at 13:08
  • Yeah, I am using version *1.9.6*. – Gaurav Singhal Dec 20 '17 at 13:39
  • Zoo-less, this seems to work: `DT1[CJ(oc = oc, state_id = state_id, unique = TRUE), on=.(oc, state_id)][DT2, on=.(oc), r_country := i.r_country][DT1, on=.(oc), oc2 := i.oc2][]`? – Frank Dec 20 '17 at 17:18
  • 1
    @Frank Thx! Added with a minor change: I think the last join should be on `state_id`. – Jaap Dec 20 '17 at 20:18
1

This can be expressed as a single SQL statement using a triple join which joins together these 4 tables:

  • oc which is formed from the unique oc values,
  • state_id which is formed from the unique state_id values,
  • DT1 and
  • DT2

using this code:

library(sqldf)

sql <- "
     with 
       oc as (select distinct oc from DT1),
       state_id as (select distinct state_id from DT1)

     select * 
       from oc
       join state_id
       left join DT1 using (oc, state_id)
       left join DT2 using (oc)
       order by state_id, oc"

sqldf(sql)

giving this output:

    oc state_id  oc2 r_state r_country
1 A011     1808  A01    1.00      0.62
2 A012     1808 <NA>      NA      0.14
3 A013     1808 <NA>      NA      0.24
4 A011     1810  A01    0.50      0.62
5 A012     1810  A01    0.50      0.14
6 A013     1810 <NA>      NA      0.24
7 A011     1814  A01    0.33      0.62
8 A012     1814  A01    0.33      0.14
9 A013     1814  A01    0.33      0.24
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341