Ok here an attempt :) I ran a few times against memory limits, but it probably depends on how close your stores are. I found a nice data set with all companies in San Francisco so ran the code fully on 30k * 50k combinations BUT my criteria was set on roughly 250 m apart.
libraries used
library(data.table)
library(geosphere)
just to prepare a similar data set
# downloaded all stores in SF
# https://towardsdatascience.com/plotting-spatial-data-in-r-a38a405a07f1
dt <- fread("sf_business_dataset.csv")
# prepare and convert the data comparable to your format
dt <- dt[, .SD, .SDcols = c("Location Id", "Business Location")]
dt[, c("LATITUDE", "LONGITUDE") := lapply(tstrsplit(gsub(".*\\((.+)\\)", "\\1", `Business Location`), ","), as.numeric)]
dt <- dt[!is.na(LATITUDE) & !is.na(LONGITUDE), .(STORE_ID = `Location Id`, LATITUDE, LONGITUDE)]
# randomize, not sure how it was ordered, so we get a randomized A and B list.
dt[, rIndex := sample(1:.N, .N, replace = F)]
setorder(dt, rIndex)
dt[, rIndex := NULL]
Lets play around, with n of 10000 we create your size, make smaller to just test the code. After this we have the sample data to play with.
# simulate your sample sizes
n <- 10000
STORE_LIST_A <- dt[1:(3*n)]
STORE_LIST_B <- dt[(3*n+1):((3+5)*n)]
solution
This looks a bit redundant, but roughly 0.008 decimal difference equals about 1km, we are not interested in any distance calculation exceeding that. Note on the SF data I made it smaller as we would have too many combinations! I decided to use foverlaps
as post filtering on cartesian join did result in 2.5 bilion combinations, good luck :)
r <- 0.002 # set to 0.008 roughly for 1km
Note that in A we set the ranges, in B we "clone" the start/end. We basically split the data and do the latitudes and longitudes seperately, if either one of them is more than 1km apart, the distance between A and B will be too!
STORE_LIST_A[, LATITUDE_start := LATITUDE - r][, LATITUDE_end := LATITUDE + r]
STORE_LIST_A[, LONGITUDE_start := LONGITUDE - r][, LONGITUDE_end := LONGITUDE + r]
STORE_LIST_B[, LATITUDE_start := LATITUDE][, LATITUDE_end := LATITUDE]
STORE_LIST_B[, LONGITUDE_start := LONGITUDE][, LONGITUDE_end := LONGITUDE]
setkey(STORE_LIST_A, LATITUDE_start, LATITUDE_end)
setkey(STORE_LIST_B, LATITUDE_start, LATITUDE_end)
close_lat <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 62M matches
setkey(STORE_LIST_A, LONGITUDE_start, LONGITUDE_end)
setkey(STORE_LIST_B, LONGITUDE_start, LONGITUDE_end)
close_lon <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 53M matches
Now we have to inner join the two as the distance is only small enough if both distances between longitude and latitude are small enough. This below on this data resulted in "only" 5.3M matches for which you have to calculate the distance.
close_enough <- close_lat[close_lon, on = .(STORE_ID, i.STORE_ID), .(STORE_ID, LONGITUDE, LATITUDE, i.STORE_ID, i.LONGITUDE, i.LATITUDE), nomatch = 0]
Do some cleaning up and filter on the actually calculated distances, note that this is still pretty slow! But you know we only have to calculate this for 5.6M cases here and not for 1500M :)
close_enough[, distance := distHaversine(c(LONGITUDE, LATITUDE), c(i.LONGITUDE, i.LATITUDE)), .(STORE_ID, i.STORE_ID)][distance < 1000]
sample output (from a smaller test run)
STORE_ID LONGITUDE LATITUDE i.STORE_ID i.LONGITUDE i.LATITUDE distance
1: 0375071-01-001 -122.49 37.761 0465561-01-001 -122.49 37.761 51.190
2: 0925437-02-001 -122.49 37.781 1030867-06-151 -122.49 37.780 142.786
3: 0434463-01-001 -122.49 37.780 1030867-06-151 -122.49 37.780 136.233
4: 0460546-01-001 -122.48 37.782 0357002-01-001 -122.48 37.783 174.721
5: 0433580-02-001 -122.48 37.764 0412914-01-001 -122.48 37.763 91.625
---
500: 1020837-02-151 -122.39 37.762 1005253-08-141 -122.39 37.760 173.533
501: 0436520-02-001 -122.39 37.760 0371484-02-001 -122.39 37.760 0.000
502: 0474822-01-001 -122.39 37.760 0371484-02-001 -122.39 37.760 33.514
503: 1020837-02-151 -122.39 37.762 0371484-02-001 -122.39 37.760 173.533
504: 1022102-03-151 -122.39 37.740 0470578-01-001 -122.39 37.738 238.217