0

I have two data frames: codes and supply. Codes (shown below) is comprised of four fields: state,codetype,code,codetitle supply has 12 columns but three of them are state,codetype, and code

An example of this is below

    state     codetype    code    codetitle
      32          15     123456     Something
      32          15     123455     Something Else
      32          10     123455     Something Else

From there, I use the following code to concatenate the item

    supply1<- supply%>%mutate(supply1= paste0(state,codetype,code))
    codes1<- codes%>%mutate(codes1= paste0(state,codetype,code))

My question is how do I find out what combinations of state,codetype,code are in supply1 but not codes1. I would use excel and the match function to do this but there are 1.9 million rows and that exceeds the capacity of Excel.

Have looked at documentation about antijoin. However, being that there is no common field such as ID, getting a bit confused.

Tim Wilcox
  • 1,275
  • 2
  • 19
  • 43
  • 2
    Sounds like you are looking for an "anti join". There is a function called `anti_join` in the `dplyr` package. Alternatively, you can just use `setdiff(supply1$supply1, codes1$codes1)` on the vectors you just created. – Vincent Oct 12 '20 at 18:42
  • 1
    @Vincent is right initially (`anti_join`), though `setdiff` will work if you're judging based on one column only. Try `anti_join(supply1, codes1, by = c("state", "codetype", "code"))`. It's feasible in base R using `merge`, though it requires an extra step (and a temp column). – r2evans Oct 12 '20 at 20:37

2 Answers2

3

tidyverse

library(dplyr)
anti_join(supply, codes, by = c("state", "codetype", "code"))
#   state codetype   code   codetitle
# 1    34       15 123459 Something_4

base R

codes$code_rn <- seq_len(nrow(codes))
supply$supply_rn <- seq_len(nrow(supply))
temp <- merge(codes, supply, by = c("state", "codetype", "code"))
temp
#   state codetype   code    codetitle.x code_rn codetitle.y supply_rn
# 1    32       15 123455 Something_Else       2 Something_3         2
# 2    32       15 123456      Something       1 Something_2         1
supply[ !supply$supply_rn %in% temp$supply_rn, ]
#   state codetype   code   codetitle supply_rn
# 3    34       15 123459 Something_4         3

(and some column clean-up)

alternative base R

This is effectively what you were starting with:

supply_id <- with(supply, paste(state, codetype, code, sep = "|"))
supply_id
# [1] "32 15 123456" "32 15 123455" "34 15 123459"
codes_id <- with(codes, paste(state, codetype, code, sep = "|"))
codes_in
# [1] "32|15|123456" "32|15|123455" "32|10|123455"
supply[!supply_id %in% codes_id,]
#   state codetype   code   codetitle supply_rn
# 3    34       15 123459 Something_4         3

data

codes <- read.table(header = TRUE, text="
    state     codetype    code    codetitle
      32          15     123456     Something
      32          15     123455     Something_Else
      32          10     123455     Something_Else")
supply <- read.table(header = TRUE, text="
    state     codetype    code    codetitle
      32          15     123456     Something_2
      32          15     123455     Something_3
      34          15     123459     Something_4")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    I kept the example small. In reality, my supply table has 234,966 rows. While the code operates with no errors, it says that I reached my max and that there are essentially 234,944 rows in the result set. – Tim Wilcox Oct 12 '20 at 21:32
  • I understand. It might be preferable then to use the `paste0(..., sep="|")` option (some separation string) and use `!a %in%` b`. See my edit in 2 minutes – r2evans Oct 12 '20 at 21:45
2

With data.table, we do a join on the columns after converting the 'supply' to a data.table (setDT). By negating (!), we check for the elements that are not matching in 'codes' dataset

library(data.table)
setDT(supply)[!codes, on = c("state", "codetype", "code")]
akrun
  • 874,273
  • 37
  • 540
  • 662