I am trying to understand OP's answer post, specifically the output table.
Focusing on the first row of the output table -- I see that SC550650NE
is reported to have 20 instances of another point within 200 of (X=255085.6, Y=465070.1)
and a sum-count of 24.
> Data[ID=="SC550650NE"]
X Y UID Count ID distance
1: 255085.6 465070.1 832751 1908 SC550650NE 0.03272809
So I calculate the distance of all the points relative to (X=255085.6, Y=465070.1)
. And then display those that are within 200 in ascending order. I count 29, not 20. Also, when I sum the counts, I get 25,463, not 24.
> Data[,distance:=sqrt((X-255085.6)^2 + (Y-465070.1)^2)]
> Data[distance <= 200][order(distance)]
X Y UID Count ID distance
1: 255085.6 465070.1 832751 1908 SC550650NE 0.03272809
2: 255093.6 465124.7 1371253 1513 SC550651SE 55.16468957
3: 255034.2 465011.8 3512521 396 SC550650SW 77.73592845
4: 255049.8 464993.6 2603353 255 SC550649NW 84.47194005
5: 255091.7 465162.7 193893 1784 SC550651NE 92.82767523
6: 254968.1 465083.5 2245518 186 SC549650NE 118.25898594
7: 255180.5 464970.4 3376500 826 SC551649NE 137.66330325
8: 255148.5 464947.4 3473247 1526 SC551649SW 137.86046404
9: 255136.8 465202.7 2560816 212 SC551652SW 142.12504000
10: 255089.6 465224.3 2788575 1429 SC550652SE 154.24575868
11: 255213.6 465158.4 2051343 331 SC552651NW 155.52946932
12: 255234.1 465015.9 601896 1175 SC552650SW 158.08971475
13: 255231.4 464993.8 2562794 1209 SC552649NW 164.57602726
14: 255236.8 465003.9 2579822 624 SC552650SW 165.05736525
15: 255163.8 465219.0 2793140 491 SC551652SE 168.19129349
16: 254997.8 465224.2 3686855 1948 SC549652SE 177.38964783
17: 255055.7 464895.0 3850088 1904 SC550648NE 177.60704244
18: 254952.1 465189.1 4992168 668 SC549651NE 178.80180844
19: 255237.6 465165.5 3778801 950 SC552651NW 179.50253647
20: 255148.1 464900.4 2842521 204 SC551649SW 180.86815284
21: 255035.4 464891.2 933305 854 SC550648NW 185.79738782
22: 255245.6 465175.2 1453263 446 SC552651NW 191.44182598
23: 255227.8 465200.5 1557782 851 SC552652SW 192.90236356
24: 254894.7 465031.3 657336 1247 SC548650SE 194.79510938
25: 255102.8 464875.4 3051082 148 SC551648NW 195.41964698
26: 254914.2 464971.4 3090917 28 SC549649NW 197.75429388
27: 255030.4 464879.7 895456 109 SC550648NW 198.20424559
28: 255281.9 465101.9 3970383 1810 SC552651SE 198.85139040
29: 255108.3 465268.3 1968621 431 SC551652NW 199.51847902
X Y UID Count ID distance
> Data[distance <= 200, sum(Count)]
[1] 25463
Edit
5 million rows
So after ill-advisedly using the comments for discussion (sorry, SO!) we have a better sense of our situation. @Chris has a solution that gives an approximation but this was a concession that was being made to try and lower the computation time. During the discussion, we also learned that the distance metric may be a square or a circle, and 4 cores are available on the machine.
I will present two solutions, one for the circle metric and one for the square metric. The circle metric is faster.
Both follow the approach in this SO post.
Circle metric
- e.g.
distance = sqrt( (x1-x2)^2 + (y1-y2)^2) ) <= 200
library(tictoc)
library(data.table)
Pick one of the num.rows
. Start small and work up to get a sense of timing. I have 32GB as well (on a mac) so I'm hoping it is comparable. Roughly it seemed doubling the number of rows led to a 4x increase in computation time.
### Pick one:
###
### num.rows <- 4e4 ### 12 seconds on 4 nodes
### num.rows <- 0.078125e6 ### 43 seconds on 4 nodes
### num.rows <- 0.15625e6 ### 154 seconds on 4 nodes
### num.rows <- 0.3125e6 ### 10.5 min on 4 nodes
### num.rows <- 0.625e6 ### 44. min on 4 nodes
### num.rows <- 1.25e6 ### ~3 hours
### num.rows <- 2.5e6 ### ~12 hours
## num.rows <- 5.0e6 ### ~ 42.53 hours on 4 nodes VERIFIED.
set.seed(123)
# Input data
tbl.data <- data.table(
X=runif(num.rows, min=200000, max=400000),
Y=runif(num.rows, min=400000, max=500000),
UID = 1:num.rows,
Count = sample(1:2000,num.rows,replace=TRUE))
setkey(tbl.data, UID)
tbl.data
setkey(tbl.data, UID)
This approach depends on passing the entire X, Y, and Count columns of the dataset. We denote them Xvec
, Yvec
, and Cvec
, respectively. From there we make a function, instances()
that will first calculate the distance of 1 point (x.in, y.in)
to all the other points in (Xvec
, Yvec
), calculate the instances and do the sumCount (the density "calculation" is done at the end and doesn't need to be done within this step or be parallelized).
This function, instances()
, is made to be deployed within a data.table, and is setup to do so in the function instances.stacked()
. Since we are passing the entire entire X, Y, and Count columns of the dataset and operating row-wise in the original dataset, we can tackle the problem in parallel. If we had parallel::detectCores()
of 5 million, we'd have one core per row and be done in a jiffy. However, we have 4, so we take a moment to make a variable group_quarters
that will split the dataset into 4 chunks via foo2()
's deployment of split_df()
(both of these adapted from the example in aforementioned SO post).
Xvec <- tbl.data$X
Yvec <- tbl.data$Y
Cvec <- tbl.data$Count
## based on https://stackoverflow.com/a/44814035/2727349
instances <- function(x=Xvec,y=Yvec, count=Cvec, x.in, y.in){
calcdist <- sqrt( (x-x.in)^2+ (y-y.in)^2 )
truth_1600 <- calcdist <= 1600
result_1600 <- c("instances_1600" = sum( truth_1600 ), "sumCount_1600"= sum( Cvec*truth_1600))
truth_0800 <- calcdist <= 800
result_0800 <- c("instances_0800" = sum( truth_0800 ), "sumCount_0800"= sum( Cvec*truth_0800))
truth_0400 <- calcdist <= 400
result_0400 <- c("instances_0400" = sum( truth_0400 ), "sumCount_0400"= sum( Cvec*truth_0400))
truth_0200 <- calcdist <= 200
result_0200 <- c("instances_0200" = sum( truth_0200 ), "sumCount_0200"= sum( Cvec*truth_0200))
result=c(result_0200, result_0400, result_0800, result_1600)
return(result)
}
instances_stacked <- function(DT, x=Xvec,y=Yvec, count=Cvec){
DT[,
as.list(instances( x.in=X, y.in=Y)),
by=c("UID")
]
}
split_df <- function(d, var) {
base::split(d, get(var, as.environment(d)))
}
## define group_quarters roughly as 25% population
tbl.data[UID < 0.25 * num.rows , group_quarters:="1st q"]
tbl.data[UID < 0.50 * num.rows & UID >= 0.25 * num.rows, group_quarters:="2nd q"]
tbl.data[UID < 0.75 * num.rows & UID >= 0.50 * num.rows, group_quarters:="3rd q"]
tbl.data[ UID >= 0.75 * num.rows, group_quarters:="4th q"]
tbl.data[, uniqueN(group_quarters)]
tbl.data[, table(group_quarters)]
foo2 <- function(dt) {
dt2 <- split_df(dt, "group_quarters")
require(parallel)
cl <- parallel::makeCluster(4) ## 4 chosen based on parallel::detectCores()
print(cl)
clusterExport(cl, varlist= "instances_stacked") ## this was the `foo` in https://stackoverflow.com/a/44814035/2727349
clusterExport(cl, varlist= "instances")
clusterExport(cl, varlist= c("Xvec","Yvec","Cvec"))
clusterExport(cl, varlist= "dt2", envir = environment())
clusterEvalQ(cl, library("data.table"))
dt2 <- parallel::parLapply(cl, X= dt2, fun=instances_stacked)
parallel::stopCluster(cl)
return(rbindlist(dt2))
}
The call to foo2() will deploy the analysis in parallel and combine everything back together into results_wide
. We can then do the density calculation and view some results.
tic("foo2() runs on 4 data chunks (1/4 sized) in parallel: ")
results_wide <- foo2(tbl.data)
## next 4 lines done after the parallel part
results_wide[,calculation_1600:=sumCount_1600/(instances_1600*0.25)]
results_wide[,calculation_0800:=sumCount_0800/(instances_0800*0.25)]
results_wide[,calculation_0400:=sumCount_0400/(instances_0400*0.25)]
results_wide[,calculation_0200:=sumCount_0200/(instances_0200*0.25)]
toc()
## let's see the results
head(results_wide)
## for num.rows<-5e6,
## should match the intro example at top of the post:
results_wide[UID==832751]
Which gives (after 42 1/2 hours):
> head(results_wide)
UID instances_0200 sumCount_0200 instances_0400 sumCount_0400 instances_0800
1: 1 28 25348 115 110519 510
2: 2 38 37791 123 128978 498
3: 3 28 26958 123 131816 506
4: 4 33 33861 125 113013 514
5: 5 31 36492 123 130939 522
6: 6 37 40887 136 141213 480
sumCount_0800 instances_1600 sumCount_1600 calculation_1600 calculation_0800
1: 521004 2025 2015967 3982.157 4086.306
2: 499358 2001 1999339 3996.680 4010.908
3: 507117 2013 2003329 3980.783 4008.830
4: 513712 2065 2115524 4097.867 3997.759
5: 516877 2054 1990644 3876.619 3960.743
6: 491969 1952 2001461 4101.355 4099.742
calculation_0400 calculation_0200
1: 3844.139 3621.143
2: 4194.407 3978.000
3: 4286.699 3851.143
4: 3616.416 4104.364
5: 4258.179 4708.645
6: 4153.324 4420.216
## for num.rows <- 5e6, should match the
## intro example at top of the post for
## Data[ID=="SC550650NE"]
> results_wide[UID==832751]
UID instances_0200 sumCount_0200 instances_0400 sumCount_0400
1: 832751 29 25463 122 110993
instances_0800 sumCount_0800 instances_1600 sumCount_1600 calculation_1600
1: 480 467032 2048 2054509 4012.713
calculation_0800 calculation_0400 calculation_0200
1: 3891.933 3639.115 3512.138
Square metric
- e.g. Question Post's
(X-200, X+200) & (Y-200, Y+200)
...NOTE...I overloaded some function names so might be best to put this code in separate script and separate R session from the circle metric code
Similar idea to circle metric
approach above, but the execution is slightly different because it is based on the bounds as opposed to the distance. Basically, we make a super stack of the data and calculate the boundaries specific for each distance. These four "boundaries" make a natural candidate for chunking the data for parallelizing -- this code goes one step farther and breaks each of those 4 in half resulting in 8 groups. We still run on 4 nodes, though. I tested a bunch on 8 nodes bc my testing environment allowed for this and I developed this code before I knew OP had a 4 core environment.
library(data.table)
library(tictoc)
library(dplyr)
set.seed(123)
### Pick one:
### num.rows <- 4e4 ### 15 seconds on 8 nodes | 23 seconds on 4 nodes
num.rows <- 1e5 ### 86 seconds on 8 nodes | 140.034 seconds on 4 nodes
### num.rows <- 2e5 ### 347 seconds on 8 nodes
### num.rows <- 4e5 ### 1370 seconds on 8 nodes
### num.rows <- 8e5 ### 1370*4 = 90 minutes?
### num.rows <- 1.6e6 ### 1370*4*4 = 360 minutes = 6 hours (confirmed: 21742.233 / 60 / 60 = 6 hours)
### num.rows <- 2.5e6 ### 47523 / 60 / 60 = 13 hours and 12 minutes VERIFIED.
### num.rows <- 5.0e6 ### 4*47523 / 60 / 60 = ~53 hours estimated.
# Input data
tbl.data <- data.table(
X=runif(num.rows, min=200000, max=400000),
Y=runif(num.rows, min=400000, max=500000),
UID = 1:num.rows,
Count = sample(1:2000,num.rows,replace=TRUE))
setkey(tbl.data, UID)
tbl.data
setkey(tbl.data, UID)
Make a superstack:
tbl.data.1600 <- copy(tbl.data)
tbl.data.0800 <- copy(tbl.data)
tbl.data.0400 <- copy(tbl.data)
tbl.data.0200 <- copy(tbl.data)
tbl.data.1600[, within_metric:=1600]
tbl.data.0800[, within_metric:= 800]
tbl.data.0400[, within_metric:= 400]
tbl.data.0200[, within_metric:= 200]
tbl.data.1600[, group:=1]
tbl.data.0800[, group:=2]
tbl.data.0400[, group:=3]
tbl.data.0200[, group:=4]
tbl.data.stack <- copy(rbind(tbl.data.1600, tbl.data.0800, tbl.data.0400, tbl.data.0200))
setkey(tbl.data.stack, group, UID)
tbl.data.stack[,X_low_bound:=X-within_metric]
tbl.data.stack[,X_upp_bound:=X+within_metric]
tbl.data.stack[,Y_low_bound:=Y-within_metric]
tbl.data.stack[,Y_upp_bound:=Y+within_metric]
Xvec <- tbl.data.stack[group==1]$X
Yvec <- tbl.data.stack[group==1]$Y
Cvec <- tbl.data.stack[group==1]$Count
I include this commented chunk -- this is how one could run it without parallelizing it. Go ahead and skip for now to next block.
###################
## BEGIN ##########
###################
## no parallel. ###
## just stacked ###
###################
# instances_stacked <- function(x=Xvec,y=Yvec, count=Cvec, xlb, xub, ylb, yub){
#
# truth <- data.table::between(x, xlb, xub) & data.table::between(y, ylb, yub)
#
# c("instances" = sum( truth ), "sumCount"= sum( Cvec*truth))
#
# }
#
#
# tic("external default vector approach -- STACKED -- all 4: ")
# results_stacked <-
# tbl.data.stack[,
# as.list(instances_stacked( xlb=X_low_bound, xub=X_upp_bound, ylb=Y_low_bound, yub=Y_upp_bound)),
# by=c("group","UID")
# ]
#
# results_stacked[,calculation:=sumCount/(instances*0.25)]
# toc()
#
# head(results_stacked)
###################
## END. ##########
###################
## no parallel. ###
## just stacked ###
###################
Below is the parallel code. I broke into 8 groups, but changed the makeCluster line to 4 nodes for your computer.
###################
## BEGIN ##########
###################
## SOCKET. ###
## ROCKET. ###
###################
## follow the leader: https://stackoverflow.com/a/44814035/2727349
## step 1: change instances_stacked (which is our `foo`) so that it returns a list. Also change it to where
## it takes data table as an argument.
## step 2: use split_df -- keep in mind might need to make a concatenated grouping of "group" and "UID"
instances <- function(x=Xvec,y=Yvec, count=Cvec, xlb, xub, ylb, yub){
truth <- data.table::between(x, xlb, xub) & data.table::between(y, ylb, yub)
c("instances" = sum( truth ), "sumCount"= sum( Cvec*truth))
}
instances_stacked <- function(DT, x=Xvec,y=Yvec, count=Cvec){
DT[,
as.list(instances( xlb=X_low_bound, xub=X_upp_bound, ylb=Y_low_bound, yub=Y_upp_bound)),
by=c("within_metric","UID")
]
}
split_df <- function(d, var) {
base::split(d, get(var, as.environment(d)))
}
## define grps to be distance-bound and UID combo
tbl.data.stack[,grps:=paste0(within_metric,"_",UID)]
tbl.data.stack[, uniqueN(grps)]
## define group_half to be distance-bound and UID < 1/2 num.rows
tbl.data.stack[,group_half:=paste0(within_metric,"_", (UID < 0.5 * num.rows)+0L)]
tbl.data.stack[, uniqueN(group_half)]
foo2 <- function(dt) {
#dt2 <- split_df(dt, "grps")
#dt2 <- split_df(dt, "group")
dt2 <- split_df(dt, "group_half")
require(parallel)
##cl <- parallel::makeCluster(min(nrow(dt), parallel::detectCores()))
#cl <- parallel::makeCluster(tbl.data.stack[, uniqueN(group_half)])
cl <- parallel::makeCluster(4)
print(cl)
clusterExport(cl, varlist= "instances_stacked") ## this was the `foo` in https://stackoverflow.com/a/44814035/2727349
clusterExport(cl, varlist= "instances")
clusterExport(cl, varlist= c("Xvec","Yvec","Cvec"))
clusterExport(cl, varlist= "dt2", envir = environment())
clusterEvalQ(cl, library("data.table"))
dt2 <- parallel::parLapply(cl, X= dt2, fun=instances_stacked)
parallel::stopCluster(cl)
return(rbindlist(dt2))
}
tic("external default vector approach -- SOCKET -- all 4 in parallel: ")
results_stacked_socket <- foo2(tbl.data.stack)
results_stacked_socket[,calculation:=sumCount/(instances*0.25)]
toc()
setkey(results_stacked_socket, UID, within_metric)
head(results_stacked_socket)
###################
## END. ##########
###################
## SOCKET. ###
## ROCKET. ###
###################