2

I have two huge data frames where one table contains fewer observations. Anti-join from dplyr works perfect and fast, but I would to keep the original row names if possible. Any function/package which allows this with the speed of anti_join? My current solution

library(magrittr)
library(dplyr)

iris
iris2 <- iris[-c(3, 9, 30, 40), ]

iris <- iris %>% mutate(rown = 1:n())

anti_join(iris, iris2) %>% set_rownames(.$rown) %>% select(-rown)

EDIT: Looking for a solution where I dont need to add rownumber.

MLEN
  • 2,162
  • 2
  • 20
  • 36
  • 1
    There is `row_number()` in dplyr i.e. instead of `1:n()`, you can use `mutate(rown = row_number())` – akrun May 08 '17 at 07:07
  • 1
    Just keep it as a column, which in the long run will be more useful anyway unless you're doing a lot with matrices: `library(tidyverse); iris %>% rownames_to_column('i') %>% anti_join(iris2)` – alistaire May 08 '17 at 07:30
  • @MLEN - it appears like you have changed the question to a different question: now trying to drop rownumbers from getting involved. Everyone already answered the old question and can't erase everything they have submitted. If you found their work helpful in answering your original question, can you select an answer for that question? Then if you want to ask a new question, please resubmit your new question as a new question. That way you will be able to get answers to your new question. – leerssej May 12 '17 at 18:25

2 Answers2

1

what about plain R?

iris$rows<-seq(1,dim(iris)[1],1)
iris2 <- iris[-c(3, 9, 30, 40), ]
iris[which(!(iris$rows %in% iris2$rows)),]

Time Comparison

iris <- iris %>% mutate(rown = 1:n())
iris$rows<-seq(1,dim(iris)[1],1)
 iris2 <- iris[-c(3, 9, 30, 40), ]


    f=function(){
      anti_join(iris, iris2) %>% set_rownames(.$rown) %>% select(-rown)
    }


f2=function(){
  iris[which(!(iris$rows %in% iris2$rows)),]   
}

    microbenchmark(f(),f2())

 expr      min        lq      mean   median       uq      max neval
  f() 1531.523 1832.0710 2384.1375 2407.061 2828.096 4084.061   100
 f2()  104.869  168.0125  238.7027  235.396  290.284  513.185   100

Without using any rows number

mtcars 
mtcars2<-mtcars[-c(3,9,40),]
mtcars[which(!row.names(mtcars) %in% row.names(mtcars2)),]

Hope it helps

Federico Manigrasso
  • 1,130
  • 1
  • 7
  • 11
  • It will not work for my real data as it does not have matching "rown". – MLEN May 08 '17 at 08:00
  • Sorry, I don't get why this should not work in your case, you just have to adapt this approach to your data, even if you have a character ID. – Federico Manigrasso May 08 '17 at 08:09
  • You can start iris and create a row number, then you divide it into 2 frames. I start with 2 frames for my data in particular without any "rows" – MLEN May 08 '17 at 08:09
  • Ok, you are right, but I guess you should have a column that you can use as a match, I mean a unique ID – Federico Manigrasso May 08 '17 at 08:10
  • in particular I wonder if u can do something like this mtcars mtcars2<-mtcars[-c(3,9,40),] mtcars[which(!row.names(mtcars) %in% row.names(mtcars2)),] ( I added this at the end of the post) – Federico Manigrasso May 08 '17 at 08:21
1

dplyr's setdiff

setdiff(iris, iris2)

It is a little fussier than an anti_join since it is of the traditional set operations ilk and needs matching columns, but this is a good use case.

The complete example:

iris <- 
    iris %>% 
    mutate(rown = 1:n())

iris2 <-
    iris[-c(3, 9, 30, 40), ] 

setdiff(iris, iris2)
> setdiff(iris, iris2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species rown
1          4.7         3.2          1.3         0.2  setosa    3
2          4.4         2.9          1.4         0.2  setosa    9
3          4.7         3.2          1.6         0.2  setosa   30
4          5.1         3.4          1.5         0.2  setosa   40

However, I reran Federico's base R solution, too, and it turns out that the which isn't the only contributor to the significant time reduction. Using the base R method to generate a rownumbers column instead of a mutate is what reduces much of the processing time.

In any case, if you just drop an iris$rown<-seq(1,dim(iris)[1],1) into your original code, and don't bother renumbering your rows, you would see a healthy drop in your processing time. which still barely wins by a whisker, though I personally find that the sacrifice of microseconds here is well worth the ease and readability of setdiff(bigTbl, smallTbl).

The complete set of assays went like this:

antjn_set_rwnms=function(){
    iris  <- iris  %>% mutate(rown = 1:n())
    iris2 <- iris [-c(3, 9, 30, 40), ]
    anti_join(iris , iris2) %>% set_rownames(.$rown) %>% select(-rown)
}
antjn_rwnum=function(){
    iris <- iris %>% mutate(rown = row_number())
    iris2 <- iris[-c(3, 9, 30, 40), ]
    anti_join(iris, iris2)
}
antjn_1n=function(){
    iris <- iris %>% mutate(rown = 1:n())
    iris2 <- iris[-c(3, 9, 30, 40), ]
    anti_join(iris, iris2)
}
stdff_seq=function(){
    iris$rown<-seq(1,dim(iris)[1],1)
    iris2 <- iris[-c(3, 9, 30, 40), ]
    setdiff(iris, iris2)
}
stdff_rwnum=function(){
    iris <- iris %>% mutate(rown = row_number())
    iris2 <- iris[-c(3, 9, 30, 40), ]
    setdiff(iris, iris2)
}
stdff_1n=function(){
    iris <- iris %>% mutate(rown = 1:n())
    iris2 <- iris[-c(3, 9, 30, 40), ]
    setdiff(iris, iris2)
}
whch_1n=function(){
    iris <- iris %>% mutate(rown = 1:n())
    iris2 <- iris[-c(3, 9, 30, 40), ]
    iris[which(!(iris$rown %in% iris2$rown)),]
}
whch_seq=function(){
    iris$rown<-seq(1,dim(iris)[1],1)
    iris2 <- iris[-c(3, 9, 30, 40), ]
    iris[which(!(iris$rown %in% iris2$rown)),]
}
antjn_seq=function(){
    iris$rown<-seq(1,dim(iris)[1],1)
    iris2 <- iris[-c(3, 9, 30, 40), ]
    anti_join(iris, iris2)
}


microbenchmark(
               antjn_set_rwnms(),
               antjn_rwnum(),
               antjn_1n(),
               stdff_1n(),
               stdff_rwnum(),
               whch_1n(),
               antjn_seq(),
               stdff_seq(),
               whch_seq(),
               times = 10000L)

Yielding the following results:

Unit: microseconds
              expr      min       lq      mean   median        uq      max neval
 antjn_set_rwnms() 2858.014 3343.828 4268.2085 3891.226 4890.2260 97763.88 10000
     antjn_rwnum() 2273.410 2612.540 3401.6724 3043.615 3897.2130 89484.50 10000
        antjn_1n() 2278.114 2617.886 3371.6463 3048.748 3899.7790 75887.23 10000
        stdff_1n() 1390.305 1591.302 2117.1687 1852.385 2465.6405 88135.68 10000
     stdff_rwnum() 1383.034 1581.466 2073.9099 1811.329 2421.8065 59467.88 10000
         whch_1n() 1347.539 1550.247 2083.1022 1810.261 2405.1275 76800.70 10000
       antjn_seq() 1168.779 1406.555 1847.2465 1621.024 2112.1845 87611.37 10000
       stdff_seq()  336.564  421.668  567.4554  480.684  660.2985 64252.05 10000
        whch_seq()  285.246  382.323  514.1107  437.063  595.7225 16619.50 10000
leerssej
  • 14,260
  • 6
  • 48
  • 57