0

I have two large dataframes of longitude/latitude coordinates, CoastalStates_Tax and CoastalStates, which are mostly the same except CoastalStates_Tax has a few million more coordinates. I want to figure out which rows in CoastalStates_Tax are not in CoastalStates, but still need to be able to track the indices of the missing rows wherever they are in the Tax dataset.

This is what CoastalStates_Tax looks like:

  RecordID_b PROPERTY LEVEL LONGITUDE PROPERTY LEVEL LATITUDE
1  132381977                -77.06421                39.16937
2  132381978                -77.18106                39.08811
3  132381979                -77.03353                39.02414
4  132381980                -77.09930                39.00716
5  132381981                -77.25450                39.10422
6  132381982                -77.02797                39.08087

And CoastalStates:

  RecordID PROPERTY LEVEL LONGITUDE PROPERTY LEVEL LATITUDE
1        1                -80.24787                25.85063
2        2                -80.14940                25.84582
3        3                -80.13115                25.85699
4        4                -80.37275                25.77741
5        5                -80.12095                25.82633
6        6                -80.39949                25.73273

I tried using the dplyr anti_join function with anti_join(CoastalStates_Tax,CoastalStates,by=c("PROPERTY LEVEL LONGITUDE","PROPERTY LEVEL LATITUDE")), but it only gives me 4,635,393 rows.
The difference in rows between the two datasets is 4,637,029, so I'm missing a about 1600 rows but I can't figure out why. Am I misusing anti_join, and if so, any suggestions on other ways to go about this?

Aurèle
  • 12,545
  • 1
  • 31
  • 49
Alissa
  • 99
  • 1
  • 5
  • 1
    The problem is joining on `numeric` field. How many decimal places in longitude and latitude are significant for you? – MKR May 19 '18 at 20:59
  • @MKR I see. It looks like each coordinate actually extends at least 20 decimal points out. I would ideally be able to consider each coordinate with all decimal places included - does `anti_join` automatically round those coordinates to the five decimal places I'm seeing when I print the rows of the datasets? – Alissa May 19 '18 at 21:14
  • Its actually problem with `double` precision capability of hardware. Why do you want to keep all decimal places in your join? I have provided a answer with rounding upto 3 decimal places. But you can go upto 6 digits for sure. I dont think any lower decimal matters much. You can keep the origional columns as is and add another sets just for joining. – MKR May 19 '18 at 21:19

1 Answers1

1

My suggestion will be to round both longitude and latitude to 3 decimal places (accuracy up to 110 meters) and than convert to character before joining on those columns.

An attempt can be as:

library(dplyr)

CoastalStates_Tax %>% 
mutate_at(vars(starts_with("PROPERTY.LEVEL")), funs(as.character(round(.,3)))) %>%
anti_join(mutate_at(CoastalStates, 
         vars(starts_with("PROPERTY.LEVEL")), funs(as.character(round(.,3)))),
            by=c("PROPERTY.LEVEL.LONGITUDE", "PROPERTY.LEVEL.LATITUDE"))

#   RecordID_b PROPERTY.LEVEL.LONGITUDE PROPERTY.LEVEL.LATITUDE
# 1  132381977                  -77.064                  39.169
# 2  132381978                  -77.181                  39.088
# 3  132381979                  -77.034                  39.024
# 4  132381980                  -77.099                  39.007
# 5  132381981                  -77.254                  39.104
# 6  132381982                  -77.028                  39.081
MKR
  • 19,739
  • 4
  • 23
  • 33