1

I'm having issues with semi_join from dplyr. Ideally I would like to do a semi join on dfA against dfB. dfA has duplicate values, and so does dfB. I want to pull back all values from dfA that have any matches against dfB even duplicates in dfA.

dfA              dfB               >>     dfC
x    y    z      x    g                   x    y    z   
1    r    5      1    lkm                 1    r    5
1    b    4      1    pok                 1    b    4
2    4    e      2    jij                 2    4    e
3    5    r      2    pop                 3    5    r
3    9    g      3    hhg                 3    9    g
4    3    0      5    trt

What I would like to get is the dfC output above. Because there is AT LEAST 1 match of x, it pulls back all x's in dfA

semi_join(dfA, dfB, by = "x")
dfC
x    y    z  
1    r    5
2    4    e
3    5    r


inner_join(dfA, dfB, by = "x")
x    y    z    g  
1    r    5    lkm
1    r    5    pok
1    b    4    lkm
1    b    4    pok
2    4    e    jij
2    4    e    pop
3    5    r    hhg
3    9    g    hhg

Neither of which give me the right result. Any help would be great! Thanks in advance

Matt W.
  • 3,692
  • 2
  • 23
  • 46

2 Answers2

2

not sure why you need a join : just use %in%

library(data.table)
setDT(dfA)[x %in% dfB$x,]

# simple base R approach :
dfA[dfA$x %in% dfB$x,]
joel.wilson
  • 8,243
  • 5
  • 28
  • 48
  • Thanks @joel.wilson - I'm still new at learning R, so I'm using what I know at this point. Your solution worked, but it also matched my semi_join solution. On semi_join does it not pull back duplicates of the ENTIRE ROW? or duplicates of they key? My worry was it was of the key, but it looks like from this example, it's duplicate entire rows. If yes, my example above is wrong. – Matt W. Jan 10 '17 at 16:46
  • @MattW. it pulls dup of entire row if im not wrong. Not much used `semi_join` – joel.wilson Jan 10 '17 at 16:50
  • @joel.wilson That would explain us coming to the same conclusion. And that fixes my issue. Thanks a lot for responding! massive help – Matt W. Jan 10 '17 at 16:50
1

if you're using dplyr and going to keep passing it down the pipe

library(dplyr)
dfA %>% filter(x %in% dfB$x)
manotheshark
  • 4,297
  • 17
  • 30