0

Using location data of stores, I'm trying to find 'competitors' -- which is defined as other stores within a certain distance.

I'm using geo sphere::distm and some matrix operations like below. The problem is that my matrix is pretty big (100,000 X 100,000) and it takes very long (or my memory doesn't support this type of analysis). Would there be a way to make the code below more efficient? The input file looks just like the locations_data (but bigger). The desired output is the data table competitors in which each row contains pairs of competitors. I'm new in writing efficient codes in R and wanted to ask for some help.

locations_data<-cbind(id=1:100, longitude=runif(100,min=-180, max=-120), latitude=runif(100, min=50, max=85))

#require(geosphere)
mymatrix<-distm(locations_data[,2:3])

#require(data.table)
analyze_competitors<-function(mymatrix){
    mymatrix2<-matrix(as.numeric(mymatrix<1000000), nrow(mymatrix), ncol(mymatrix)) #
    competitors<-which(mymatrix2==1,arr.ind = T)
    competitors<-data.table(competitors)
    return(competitors)
}

competitors<-analyze_competitors(mymatrix)
wyatt
  • 371
  • 3
  • 13
  • 2
    This is a difficult problem to solve efficiently. Here is a solution to the memory problem. https://stackoverflow.com/questions/58540031/r-and-spark-compare-distance-between-different-geographical-points/58567531#58567531. Some improvement can be made by reducing the size of the search list down to just the locations which are close by (ie within x degrees of longitude or latitude) – Dave2e Jan 17 '20 at 21:15

1 Answers1

1

If you want a smaller matrix ,consider splitting data with a grid based on longitude and/or latitude. For example, this produces two new columns with labels for a 5 x 5 grid.

#converting your example data to a tibble.
locations_data<-tibble::as_tibble(locations_data)
#create a numeric grid spanning the extent of your latitude and longitude
locations_data$long_quant<-findInterval(locations_data$longitude, quantile(locations_data$longitude,probs = seq(0,1,.2)), rightmost.closed=TRUE)
locations_data$lat_quant<-findInterval(locations_data$latitude, quantile(locations_data$latitude,probs = seq(0,1,.2)), rightmost.closed=TRUE)

You could then create multiple smaller matrices using a subset of locations_data.

SEAnalyst
  • 1,077
  • 8
  • 15