2

Hi I have a dataset and I am trying to get a group cluster id based on the 50 mile radius. Here is the structure of the dataset

g_lat<- c(45.52306, 40.26719, 34.05223, 37.38605, 37.77493)
g_long<- c(-122.67648,-86.13490, -118.24368, -122.08385, -122.41942)
df<- data.frame(g_lat, g_long)

I want to create a group cluster id which is basically going to group locations that are within 50 mile radius. Let me know how I can achieve this? Thanks so much. Below is the expected output.

 g_lat      g_long      clusterid
45.52306   -122.67648    1 
40.26719    -86.13490    2
34.05223    -118.24368   3
37.38605    -122.08385   4
37.77493    -122.41942   4
user3570187
  • 1,743
  • 3
  • 17
  • 34

2 Answers2

2
g_lat<- c(45.52306, 40.26719, 34.05223, 37.38605, 37.77493)
g_long<- c(-122.67648,-86.13490, -118.24368, -122.08385, -122.41942)
df<- data.frame(point = c(1:5), longitude = g_long, latitude = g_lat)

library(sf)
my.sf.point <- st_as_sf(x = df, 
                        coords = c("longitude", "latitude"),
                        crs = "+proj=longlat +datum=WGS84")

#distance matrix in feet
st_distance(my.sf.point)

#which poiint are within 50 miles (~80467.2 meters)
l <- st_is_within_distance(my.sf.point, dist = 80467.2 )

l
# Sparse geometry binary predicate list of length 5, where the predicate was `is_within_distance'
# 1: 1
# 2: 2
# 3: 3
# 4: 4, 5
# 5: 4, 5

df$within_50 <- rowSums(as.matrix(l))-1

df
#   point longitude latitude within_50
# 1     1 -122.6765 45.52306         0
# 2     2  -86.1349 40.26719         0
# 3     3 -118.2437 34.05223         0
# 4     4 -122.0838 37.38605         1
# 5     5 -122.4194 37.77493         1


m <- as.matrix(l)
colnames(m) <- c(1:nrow(df))
rownames(m) <- c(1:nroe(df))
df$points_within_50 <- apply( m, 1, function(u) paste( names(which(u)), collapse="," ) )
df$clusterid <- dplyr::group_indices(df, df$points_within_50) 

#   point longitude latitude within_50 points_within_50 clusterid
# 1     1 -122.6765 45.52306         0                1         1
# 2     2  -86.1349 40.26719         0                2         2
# 3     3 -118.2437 34.05223         0                3         3
# 4     4 -122.0838 37.38605         1              4,5         4
# 5     5 -122.4194 37.77493         1              4,5         4
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thanks, how do I append this to main data frame to get a cluster id which are within 50 miles? – user3570187 Sep 16 '18 at 19:55
  • I meant how do I add a column called cluster id to show that 4, t locations are within 50 miles. See my expected output with cluster id 4 for last 2 rows, thanks so much for the help – user3570187 Sep 16 '18 at 20:01
  • @user3570187 see updates answer.. you can treat the `l`as amtrix, so just sum the rows to get the number of ppoints <50 (true = 1)... Don't forget to subtract 1! – Wimpel Sep 16 '18 at 20:09
  • Thanks, the above solution shows the location within 50 miles, but I don’t know which one is closer to which with row totals, for instance there can be many such groups, hence I need to know which locations are within 50 miles of each other , I need to have a cluster id which shows which locations are closer see the expected output, thanks so much for your help! – user3570187 Sep 16 '18 at 20:20
  • For example Bronx Nyc and queens Nyc will have 1 row total, at the same time, Mountain View and San Francisco will have 1 row total as well, so how do I uniquely identify which one is closer to which, thanks again – user3570187 Sep 16 '18 at 20:22
0

You can create 2d matrix with distances between the locations. The geosphere has a function that does the heavy lifting for you.

library(geosphere)
library(magrittr)

g_lat <- c(45.52306, 40.26719, 34.05223, 37.38605, 37.77493)
g_long <- c(-122.67648,-86.13490, -118.24368, -122.08385, -122.41942)
m <- cbind(g_long, g_lat) 

(matrix <- distm(m) / 1609.34)
#>           [,1]     [,2]      [,3]      [,4]      [,5]
#> [1,]    0.0000 1872.882  825.4595  562.3847  534.8927
#> [2,] 1872.8818    0.000 1812.5862 1936.5786 1946.4373
#> [3,]  825.4595 1812.586    0.0000  315.2862  347.3751
#> [4,]  562.3847 1936.579  315.2862    0.0000   32.5345
#> [5,]  534.8927 1946.437  347.3751   32.5345    0.0000
matrix < 50 
#>       [,1]  [,2]  [,3]  [,4]  [,5]
#> [1,]  TRUE FALSE FALSE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE
#> [3,] FALSE FALSE  TRUE FALSE FALSE
#> [4,] FALSE FALSE FALSE  TRUE  TRUE
#> [5,] FALSE FALSE FALSE  TRUE  TRUE
colSums(matrix < 50)
#> [1] 1 1 1 2 2

Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
Birger
  • 1,111
  • 7
  • 17