How can I filter out coordinates (lat, lon) in a data.table?

Question

TL;DR

This image of a left outer join depicts exactly what I would like: removing rows of a data.table based on two columns lat, lon that exactly match the lat, lon columns of another data.table.

The Problem

Suppose I have the following data.table "dt.master" with over 1 million of rows containing an id and coordinates of a certain location lat, lon:

id    lat      lon
1     43.23    5.43
2     43.56    4.12
3     52.14   -9.85
4     43.56    4.12
5     43.83    9.43
...   ...      ...

What I would like to do is remove the rows that match a certain pair of coordinates. You could think of that pair of coordinates as being put on the following blacklist (again a data.table named "dt.blacklist"):

lat      lon
43.56    4.12
11.14   -5.85

In this case, when applying the blacklist, the answer would have to be:

id    lat      lon
1     43.23    5.43
3     52.14   -9.85
5     43.83    9.43
...   ...      ...

As straightforward as it seems, I cannot get it right.

What I did so far

Using merge, like such:
```
dt.result <- merge(dt.master, dt.blacklist[, c("lat", "lon")], by.x=c("lat", "lon"), by.y=c("lat", "lon"))
```
But that yields the rows that match and is thus an inner join. I thought about removing rows based on this result by using subset:
```
subset(dt.master, lat != dt.result$lat & lon != dt.result$lon)
```
But the issue is that it partially works as then only 1 row in the above example is removed and not 2 rows as I would like. Somehow it only removes the first "hit".
Using a quick and dirty solution by concatenating lat, lon to a new column named "C" in both data tables and then removing it as such:
```
dt.master[C != dt.blacklist$C]
```
Yet, the same issue arises where only 1 of the two rows is removed.

As these are floats, could you check if the numbers match – akrun Jun 10 '17 at 09:01 — akrun, Jun 10 '17 at 09:01

h3rm4n · Accepted Answer · 2017-06-10T10:13:52.503

4

I think you are looking for this:

dt.master[!dt.blacklist, on = .(lat,lon)]

The output:

   id   lat   lon
1:  1 43.23  5.43
2:  3 52.14 -9.85
3:  5 43.83  9.43

Thanks to the warning of the green wise guy, that joinng on floating points could have unintended side-effects. By converting to integers you could prevent that. The join will as a result look a bit more complicated:

dt.master[, (2:3) := lapply(.SD,function(x) as.integer(x*100)), .SDcols = 2:3
          ][!dt.blacklist[, (1:2) := lapply(.SD,function(x) as.integer(x*100))], on = .(lat,lon)
            ][, (2:3) := lapply(.SD, `/`, 100), .SDcols = 2:3][]

The output is the same:

   id   lat   lon
1:  1 43.23  5.43
2:  3 52.14 -9.85
3:  5 43.83  9.43

edited Jun 10 '17 at 10:13

answered Jun 10 '17 at 09:16

h3rm4n

4,126
15
21

1

Known as an anti-join – SymbolixAU Jun 10 '17 at 09:39
I'd be really cautious in doing a join on floating point numbers... maybe it would be better converting those coordinates to integers (e.g. 43.23 --> 4323) and then join them... – digEmAll Jun 10 '17 at 09:56
1

@digEmAll thank you very much, updated the answer to incorporate that issue – h3rm4n Jun 10 '17 at 10:15
Great, this works, thank you! – dkreeft Jun 10 '17 at 10:56

score 2 · Answer 2 · answered Jun 10 '17 at 10:30

2

We can use fsetdiff from data.table

fsetdiff(df1[,-1], df2)

or can use anti_join from dplyr

library(dplyr)
anti_join(df1, df2)
#  id   lat   lon
#1  1 43.23  5.43
#2  3 52.14 -9.85
#3  5 43.83  9.43

answered Jun 10 '17 at 10:30

akrun

874,273
37
540
662

Thanks for the simple solution! – dkreeft Jun 10 '17 at 10:56

How can I filter out coordinates (lat, lon) in a data.table?

TL;DR

The Problem

What I did so far

2 Answers2