TL;DR
This image of a left outer join depicts exactly what I would like: removing rows of a data.table
based on two columns lat, lon
that exactly match the lat, lon
columns of another data.table
.
The Problem
Suppose I have the following data.table
"dt.master"
with over 1 million of rows containing an id
and coordinates of a certain location lat, lon
:
id lat lon
1 43.23 5.43
2 43.56 4.12
3 52.14 -9.85
4 43.56 4.12
5 43.83 9.43
... ... ...
What I would like to do is remove the rows that match a certain pair of coordinates. You could think of that pair of coordinates as being put on the following blacklist (again a data.table
named "dt.blacklist"
):
lat lon
43.56 4.12
11.14 -5.85
In this case, when applying the blacklist, the answer would have to be:
id lat lon
1 43.23 5.43
3 52.14 -9.85
5 43.83 9.43
... ... ...
As straightforward as it seems, I cannot get it right.
What I did so far
Using
merge
, like such:dt.result <- merge(dt.master, dt.blacklist[, c("lat", "lon")], by.x=c("lat", "lon"), by.y=c("lat", "lon"))
But that yields the rows that match and is thus an inner join. I thought about removing rows based on this result by using
subset
:subset(dt.master, lat != dt.result$lat & lon != dt.result$lon)
But the issue is that it partially works as then only 1 row in the above example is removed and not 2 rows as I would like. Somehow it only removes the first "hit".
Using a quick and dirty solution by concatenating
lat, lon
to a new column named"C"
in both data tables and then removing it as such:dt.master[C != dt.blacklist$C]
Yet, the same issue arises where only 1 of the two rows is removed.