I have an iterative method of partitioning that assigns a label to each observation and continues until all partitions are less than or equal to the specified minimum observations.
Using data.table
I have run into issues incorporating '{'
and ':='
. My current solution is the following:
part.test <- function(x, y, min.obs=4){
PART = data.table(x=as.numeric(x),y=as.numeric(y),quadrant='q',prior.quadrant='q',key = c('quadrant','x','y'))
PART=PART[,counts := .N,quadrant]
setkey(PART,counts,quadrant,x,y)
i=0L
while(i>=0){
PART=PART[,counts := .N,quadrant]
l.PART=sum(PART$counts>min.obs)
if(l.PART==0){break}
min.obs.rows=PART[counts>=min.obs,which=TRUE]
PART[min.obs.rows, prior.quadrant := quadrant]
PART[min.obs.rows, quadrant :=
ifelse( x <= mean(x) & y <= mean(y), paste0(quadrant,4),
ifelse(x <= mean(x) & y > mean(y), paste0(quadrant,2),
ifelse(x > mean(x) & y <= mean(y), paste0(quadrant,3), paste0(quadrant,1)))),
by=quadrant]
i=i+1
}
return(PART[])
}
Here is an example:
> set.seed(123);x=rnorm(1e5);y=rnorm(1e5)
> part.test(x,y)
x y quadrant prior.quadrant counts
1: 2.45670228 2.4710128 q1111141 q111114 1
2: 2.36216477 2.3211246 q1111144 q111114 1
3: 2.03019608 3.1102172 q1111212 q111121 1
4: 2.18349873 2.7801719 q1111213 q111121 1
5: 2.14224180 2.5529947 q1111231 q111123 1
---
99996: 0.51221861 0.1992352 q143234342 q14323434 4
99997: 0.08995397 -0.6415489 q324423131 q32442313 4
99998: 0.09069140 -0.6427392 q324423131 q32442313 4
99999: 0.09077251 -0.6406127 q324423131 q32442313 4
100000: 0.09077963 -0.6413572 q324423131 q32442313 4
> system.time(part.test(x,y))
user system elapsed
3.45 0.00 3.53
What is the best way to improve this performance using data.table
?
EDIT: I have moved the setkey
outside of the loop per Frank's comment.