dplyr's setdiff
setdiff(iris, iris2)
It is a little fussier than an anti_join since it is of the traditional set operations ilk and needs matching columns, but this is a good use case.
The complete example:
iris <-
iris %>%
mutate(rown = 1:n())
iris2 <-
iris[-c(3, 9, 30, 40), ]
setdiff(iris, iris2)
> setdiff(iris, iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species rown
1 4.7 3.2 1.3 0.2 setosa 3
2 4.4 2.9 1.4 0.2 setosa 9
3 4.7 3.2 1.6 0.2 setosa 30
4 5.1 3.4 1.5 0.2 setosa 40
However, I reran Federico's base R solution, too, and it turns out that the which
isn't the only contributor to the significant time reduction. Using the base R method to generate a rownumbers column instead of a mutate is what reduces much of the processing time.
In any case, if you just drop an iris$rown<-seq(1,dim(iris)[1],1)
into your original code, and
don't bother renumbering your rows, you would see a healthy drop in your processing time. which
still barely wins by a whisker, though I personally find that the sacrifice of microseconds here is well worth the ease and readability of setdiff(bigTbl, smallTbl)
.
The complete set of assays went like this:
antjn_set_rwnms=function(){
iris <- iris %>% mutate(rown = 1:n())
iris2 <- iris [-c(3, 9, 30, 40), ]
anti_join(iris , iris2) %>% set_rownames(.$rown) %>% select(-rown)
}
antjn_rwnum=function(){
iris <- iris %>% mutate(rown = row_number())
iris2 <- iris[-c(3, 9, 30, 40), ]
anti_join(iris, iris2)
}
antjn_1n=function(){
iris <- iris %>% mutate(rown = 1:n())
iris2 <- iris[-c(3, 9, 30, 40), ]
anti_join(iris, iris2)
}
stdff_seq=function(){
iris$rown<-seq(1,dim(iris)[1],1)
iris2 <- iris[-c(3, 9, 30, 40), ]
setdiff(iris, iris2)
}
stdff_rwnum=function(){
iris <- iris %>% mutate(rown = row_number())
iris2 <- iris[-c(3, 9, 30, 40), ]
setdiff(iris, iris2)
}
stdff_1n=function(){
iris <- iris %>% mutate(rown = 1:n())
iris2 <- iris[-c(3, 9, 30, 40), ]
setdiff(iris, iris2)
}
whch_1n=function(){
iris <- iris %>% mutate(rown = 1:n())
iris2 <- iris[-c(3, 9, 30, 40), ]
iris[which(!(iris$rown %in% iris2$rown)),]
}
whch_seq=function(){
iris$rown<-seq(1,dim(iris)[1],1)
iris2 <- iris[-c(3, 9, 30, 40), ]
iris[which(!(iris$rown %in% iris2$rown)),]
}
antjn_seq=function(){
iris$rown<-seq(1,dim(iris)[1],1)
iris2 <- iris[-c(3, 9, 30, 40), ]
anti_join(iris, iris2)
}
microbenchmark(
antjn_set_rwnms(),
antjn_rwnum(),
antjn_1n(),
stdff_1n(),
stdff_rwnum(),
whch_1n(),
antjn_seq(),
stdff_seq(),
whch_seq(),
times = 10000L)
Yielding the following results:
Unit: microseconds
expr min lq mean median uq max neval
antjn_set_rwnms() 2858.014 3343.828 4268.2085 3891.226 4890.2260 97763.88 10000
antjn_rwnum() 2273.410 2612.540 3401.6724 3043.615 3897.2130 89484.50 10000
antjn_1n() 2278.114 2617.886 3371.6463 3048.748 3899.7790 75887.23 10000
stdff_1n() 1390.305 1591.302 2117.1687 1852.385 2465.6405 88135.68 10000
stdff_rwnum() 1383.034 1581.466 2073.9099 1811.329 2421.8065 59467.88 10000
whch_1n() 1347.539 1550.247 2083.1022 1810.261 2405.1275 76800.70 10000
antjn_seq() 1168.779 1406.555 1847.2465 1621.024 2112.1845 87611.37 10000
stdff_seq() 336.564 421.668 567.4554 480.684 660.2985 64252.05 10000
whch_seq() 285.246 382.323 514.1107 437.063 595.7225 16619.50 10000