0

How can I, from the following two data frames:

df1:

cat1        cat2
a           NA
b           NA
c           NA
d           NA
e           NA

df2:

cat1        cat2
c           1
d           2

Produce the following result as efficiently as possible?

cat1        cat2
a           NA
b           NA
c           1
d           2
e           NA

When I do:

df3 <- rbind(df2,df1[!(df1$cat1 %in% df2$cat1),])
merge(df1,df3,all.y=TRUE)

I get the desired dataframe. But is there not a neater, perhaps more efficient way to do this? (This is just dummy data - I have 700k lines of data in reality)

Frank
  • 66,179
  • 8
  • 96
  • 180
Mike Haye
  • 793
  • 3
  • 12
  • 22
  • 1
    You should write code to produce a sufficiently large example, probably as a function of the number of rows or similar, and taking care regarding whether `cat1` values are unique or not. – Frank Apr 12 '17 at 17:38
  • Fwiw, this is probably the fastest way: http://stackoverflow.com/q/14720923/ If that works, we could close the question. – Frank Apr 12 '17 at 17:45

1 Answers1

1

How about this:

df1 <-read.table(text="cat1        cat2
a           NA
b           NA
c           NA
d           NA
e           NA",header=TRUE,stringsAsFactors=FALSE)

df2<-read.table(text="cat1        cat2
c           1
d           2",header=TRUE, stringsAsFactors=FALSE)

df1[df1$cat1%in%df2$cat1,] <-df2

  cat1 cat2
1    a   NA
2    b   NA
3    c    1
4    d    2
5    e   NA

EDIT

I ran a microbenchmark on your solution, my solution and the data.table solution in another answer and mine is the fastest by far.

The mean time to do 1000 calculations of df1[df1$cat1%in%df2$cat1,] <-df2 is 80 microseconds. This compares to 690 microseconds for the data.table solution and a total of 1062 microsseconds for your solution. So effectively, my solution is an order of magnitude faster.

library(microbenchmark)
res <- microbenchmark(
rbind(df2,df1[!(df1$cat1 %in% df2$cat1),]),
merge(df1,df3,all.y=TRUE),
df1[df1$cat1%in%df2$cat1,] <-df2,
dat1[dat][,1:2,with=T],
times=1000L)

> print(res)
Unit: microseconds
                                         expr     min       lq    *mean*   median       uq      max neval
 rbind(df2, df1[!(df1$cat1 %in% df2$cat1), ]) 242.395 260.3555 279.3699 268.3550 277.5615 2817.263  1000
                merge(df1, df3, all.y = TRUE) 679.488 724.1640 783.2416 740.1625 761.5940 6756.541  1000
         df1[df1$cat1 %in% df2$cat1, ] <- df2  63.392  72.1450  80.0050  75.1640  80.5975 2017.334  1000
                   dat1[dat][, 1:2, with = T] 602.816 649.6040 690.9846 665.3010 691.2615 3264.319  1000

EDIT2

Another microbenchmark with 100,000 datapoints and including the setkeyv steps of data.table. The base indexing (df[df$cat1 %in% df1$cat1, ] <- df) is slightly faster (7 milliseconds on average) than the total steps of data.table (7.4 milliseconds) but not much. The efficiency will depend on OP's actual data set.

library(data.table)
dat <- data.table(cat1=c(paste0("a",1:100000)),cat2=rep(NA,100000))
dat1 <- data.table(cat1=c(paste0("a",sample(1:100000,10001))),cat2=1:10001)
setkeyv(dat,"cat1")
setkeyv(dat1,"cat1")
df <- data.frame(dat)
df1 <- data.frame(dat1)

library(microbenchmark)
res <- microbenchmark(
   merge(df,df1,all.y=TRUE),
   df[df$cat1 %in% df1$cat1, ] <- df1,
   setkeyv(dat,"cat1"),
   setkeyv(dat1,"cat1"),
   dat1[dat][,1:2,with=T],
   times=100L)
print(res)
Unit: microseconds
                               expr       min        lq       mean    median         uq       max neval  cld
       merge(df, df1, all.y = TRUE) 96573.600 98317.435 115509.544 102872.81 130325.979 195910.42   100    d
 df[df$cat1 %in% df1$cat1, ] <- df1  4329.293  4785.601   7059.100   5054.74   5632.501  40521.16   100   c 
               setkeyv(dat, "cat1")  1166.073  1568.211   1928.071   1766.36   1913.329  14256.59   100 ab  
              setkeyv(dat1, "cat1")   215.253   296.935    434.589    443.05    506.629   1279.54   100 a   
         dat1[dat][, 1:2, with = T]  3531.004  4020.242   5024.882   4195.72   4587.026  34787.45   100  bc 
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • 1
    If you did your benchmark on the tiny dataset shown here, it's not necessarily meaningful. – Frank Apr 12 '17 at 17:40
  • @Frank You are right. I added another edit where I reran the microbenchmark with 100,000 data points and included the `data.table` overhead `setkeyv` steps. In this case, the base `df[df$cat1 %in% df1$cat1, ] <- df1` indexing is still faster but only slightly so. It will depend on OP's actual data set. – Pierre Lapointe Apr 12 '17 at 18:28
  • Okay, thanks. I've put my benchmark with data.table update join in chat: http://chat.stackoverflow.com/transcript/message/36605566#36605566 It doesn't require sorting/keying and will correctly handle cases where lookups in `df1$cat1` do not appear in `df$cat1`, appear multiple times, etc. – Frank Apr 12 '17 at 18:37
  • Thanks! Slightly faster. Marked answer – Mike Haye Apr 13 '17 at 16:43