3

I have to dataset, one with some location (lat,lon), that's test, and one with the lat/lon information of all zip codes in NYC, that's test2.

test <- structure(list(trip_count = 1:10, dropoff_longitude = c(-73.959862, 
                                                        -73.882202, -73.934113, -73.992203, -74.00563, -73.975189, -73.97448, 
                                                        -73.974838, -73.981377, -73.955093), dropoff_latitude = c(40.773617, 
                                                                                                                  40.744175, 40.715923, 40.749203, 40.726158, 40.729824, 40.763599, 
                                                                                                                  40.754135, 40.759987, 40.765224)), row.names = c(NA, -10L), class = c("tbl_df", 
                                                                                                                                                                                        "tbl", "data.frame"))
test2 <- structure(list(latitude = c(40.853017, 40.791586, 40.762174, 
40.706903, 40.825727, 40.739022, 40.750824, 40.673138, 40.815559, 
40.754591), longitude = c(-73.91214, -73.94575, -73.94917, -73.82973, 
-73.81752, -73.98205, -73.99289, -73.81443, -73.90771, -73.976238
), borough = c("Bronx", "Manhattan", "Manhattan", "Queens", "Bronx", 
"Manhattan", "Manhattan", "Queens", "Bronx", "Manhattan")), class = "data.frame", row.names = c(NA, 
-10L))

I am now trying to join these two datasets so that in the end for every trip_countI get one borough. So far I used difference_left_joinfor that like this:

test %>% fuzzyjoin::difference_left_join(test2,by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"), max_dist = 0.01)

Even though this approach works, as the datasets get larger this join creates a lot of multiple matches and so I end up with a dataset that is sometimes ten times as large as the inital one test. Does anyone has a different approach to solving this without creating multpile matches? Or is there any way I can force the join to always just use one match for every row in test? I would highly appreciate it!

EDIT: Solving this problem R dplyr left join - multiple returned values and new rows: how to ask for the first match only? would also solve mine. So maybe one of you has an idea about that!

Maximilian
  • 89
  • 1
  • 7

1 Answers1

3

You could you the geo_join functions and return the distance between matches and then filter down to the closest match.

library(fuzzyjoin)
library(dplyr)

answer <-geo_left_join(test, test2, by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"), 
                   max_dist = 2, distance_col = "dist") %>% 
         select(-"longitude", -"latitude")

answer  %>% group_by(trip_count) %>% slice_min(dist)

You may want to adjust the value for "max_dist" down to reduce the number of matches, it should improve the performance but may generate too many NAs.

Update
Rounding to 3 decimal places is at most a 70 meter/230 ft offset. Rounding to fewer decimal digits reduces the number of unique points but increases the maximun offset.

Here is how I would handle rounding the drop-off location and performing the join. It adds complexity, but may help with the memory issues. I have not considered the group_by function here but that could also work.

#create a unique id for each rounded lon & lat
test$hash <-paste(round(test$dropoff_longitude, 3), round(test$dropoff_latitude, 3))
#the unique ids
uniques <- which(!duplicated(test$hash))
#create a reduced size data frame 
reduced <- data.frame(hash= test$hash, 
                      dropoff_longitude = round(test$dropoff_longitude, 3), 
                      dropoff_latitude = round(test$dropoff_latitude, 3))[uniques,]

#Preform matching here
#using the formula above or something else.
# adding the matched column onto the reduced dataframe
    reduced$matched <- letters[1:nrow(reduced)]
#this example is just adding on a column of letters

#merge back to the original adata set
test %>% left_join(reduced[ , c("hash", "matched")], by=("hash"))
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • Hey Dave! Thanks for this approach! It seems to work, however, it does not prevent the joined dataset from getting way larger than the initial one. What I am searching for is an approach to stop the join from adding new rows in the first place since my original dataset is quite large and all the additional rows are killing my RAM. – Maximilian Dec 05 '20 at 19:20
  • 1
    If your data set is that large consider sorting the data and breaking it down into smaller sections. It would improve performance if you compared the lower latitudes with the Staten Island neighborhoods and the upper latitudes for the Bronx neighborhoods. Another possibility and depending on the desired level of accuracy, is rounding the latitude and longitude values down to 3 or 4 digits and then grouping the similar drop_off locations together. I am sure hundreds of points are within a few feet of Grand Central's entrance. – Dave2e Dec 05 '20 at 19:33
  • Hey Dave, thanks a lot! After rounding lon/lat, what du you mean by grouping? Like actually creating subsets of the grouped locations or do you thin `group_by(dropoff_longitude)`would be enough? – Maximilian Dec 05 '20 at 20:35