3

TL;DR

This image of a left outer join depicts exactly what I would like: removing rows of a data.table based on two columns lat, lon that exactly match the lat, lon columns of another data.table.

The Problem

Suppose I have the following data.table "dt.master" with over 1 million of rows containing an id and coordinates of a certain location lat, lon:

id    lat      lon
1     43.23    5.43
2     43.56    4.12
3     52.14   -9.85
4     43.56    4.12
5     43.83    9.43
...   ...      ...

What I would like to do is remove the rows that match a certain pair of coordinates. You could think of that pair of coordinates as being put on the following blacklist (again a data.table named "dt.blacklist"):

lat      lon
43.56    4.12
11.14   -5.85

In this case, when applying the blacklist, the answer would have to be:

id    lat      lon
1     43.23    5.43
3     52.14   -9.85
5     43.83    9.43
...   ...      ...  

As straightforward as it seems, I cannot get it right.

What I did so far

  • Using merge, like such:

    dt.result <- merge(dt.master, dt.blacklist[, c("lat", "lon")], by.x=c("lat", "lon"), by.y=c("lat", "lon"))
    

    But that yields the rows that match and is thus an inner join. I thought about removing rows based on this result by using subset:

    subset(dt.master, lat != dt.result$lat & lon != dt.result$lon)
    

    But the issue is that it partially works as then only 1 row in the above example is removed and not 2 rows as I would like. Somehow it only removes the first "hit".

  • Using a quick and dirty solution by concatenating lat, lon to a new column named "C" in both data tables and then removing it as such:

    dt.master[C != dt.blacklist$C]
    

    Yet, the same issue arises where only 1 of the two rows is removed.

h3rm4n
  • 4,126
  • 15
  • 21
dkreeft
  • 642
  • 5
  • 17

2 Answers2

4

I think you are looking for this:

dt.master[!dt.blacklist, on = .(lat,lon)]

The output:

   id   lat   lon
1:  1 43.23  5.43
2:  3 52.14 -9.85
3:  5 43.83  9.43

Thanks to the warning of the green wise guy, that joinng on floating points could have unintended side-effects. By converting to integers you could prevent that. The join will as a result look a bit more complicated:

dt.master[, (2:3) := lapply(.SD,function(x) as.integer(x*100)), .SDcols = 2:3
          ][!dt.blacklist[, (1:2) := lapply(.SD,function(x) as.integer(x*100))], on = .(lat,lon)
            ][, (2:3) := lapply(.SD, `/`, 100), .SDcols = 2:3][]

The output is the same:

   id   lat   lon
1:  1 43.23  5.43
2:  3 52.14 -9.85
3:  5 43.83  9.43
h3rm4n
  • 4,126
  • 15
  • 21
2

We can use fsetdiff from data.table

fsetdiff(df1[,-1], df2)

or can use anti_join from dplyr

library(dplyr)
anti_join(df1, df2)
#  id   lat   lon
#1  1 43.23  5.43
#2  3 52.14 -9.85
#3  5 43.83  9.43
akrun
  • 874,273
  • 37
  • 540
  • 662