1

I have an iterative method of partitioning that assigns a label to each observation and continues until all partitions are less than or equal to the specified minimum observations.

Using data.table I have run into issues incorporating '{' and ':='. My current solution is the following:

part.test <- function(x, y, min.obs=4){
    PART = data.table(x=as.numeric(x),y=as.numeric(y),quadrant='q',prior.quadrant='q',key = c('quadrant','x','y'))
    PART=PART[,counts := .N,quadrant]
    setkey(PART,counts,quadrant,x,y)

    i=0L
    while(i>=0){

    PART=PART[,counts := .N,quadrant]
    l.PART=sum(PART$counts>min.obs)
    if(l.PART==0){break}
    min.obs.rows=PART[counts>=min.obs,which=TRUE]

    PART[min.obs.rows, prior.quadrant := quadrant]
    PART[min.obs.rows, quadrant :=
          ifelse( x <= mean(x) & y <= mean(y), paste0(quadrant,4),
            ifelse(x <= mean(x) & y > mean(y), paste0(quadrant,2),
              ifelse(x > mean(x) & y <= mean(y), paste0(quadrant,3), paste0(quadrant,1)))), 
        by=quadrant]

    i=i+1
    }

    return(PART[])

}

Here is an example:

> set.seed(123);x=rnorm(1e5);y=rnorm(1e5)
> part.test(x,y)
                 x          y   quadrant prior.quadrant counts
     1: 2.45670228  2.4710128   q1111141        q111114      1
     2: 2.36216477  2.3211246   q1111144        q111114      1
     3: 2.03019608  3.1102172   q1111212        q111121      1
     4: 2.18349873  2.7801719   q1111213        q111121      1
     5: 2.14224180  2.5529947   q1111231        q111123      1
   ---                                                       
 99996: 0.51221861  0.1992352 q143234342      q14323434      4
 99997: 0.08995397 -0.6415489 q324423131      q32442313      4
 99998: 0.09069140 -0.6427392 q324423131      q32442313      4
 99999: 0.09077251 -0.6406127 q324423131      q32442313      4
100000: 0.09077963 -0.6413572 q324423131      q32442313      4
> system.time(part.test(x,y))
   user  system elapsed 
   3.45    0.00    3.53 

What is the best way to improve this performance using data.table?

EDIT: I have moved the setkey outside of the loop per Frank's comment.

Fred Viole
  • 153
  • 7
  • Read `?GForce`, use arithmetic instead of `ifelse`; and don't sort/`setkey` in the loop. Your approach is good apart from those things, I think. – Frank May 03 '17 at 15:50
  • I thought `GForce` is an automatic optimization. I will re-read. By using arithmetic instead of `ifelse`, am I forced to call `PART[]` for each operation? Also is there a temporary way to store each mean rather than calculating it for each operation? I will edit the question and move the `setkey` out of the loop, thanks Frank! – Fred Viole May 03 '17 at 16:13
  • Yeah, GForce is only in effect if the `j` expression is simple. You can see if it's active by adding `verbose = TRUE` in the `DT[...]` call. Re storing the means; yeah, I'd use a table, as seen below. – Frank May 03 '17 at 16:18

1 Answers1

1

Elaborating on my comment, here's some improvement:

f <- function(x, y, min.obs = 4){
    DT = data.table(x,y,q="q")[, counts := .N]

    while(TRUE){
        DT[counts >= min.obs, counts := .N, by=q]
        if (max(DT$counts) == min.obs) break

        w = DT[counts >= min.obs, which=TRUE]
        mDT = DT[w, lapply(.SD, mean), by=q, .SDcols = x:y]
        DT[mDT, on=.(q), q_new := {
          lox = x.x <= i.x
          loy = x.y <= i.y
          1L + lox + loy*2L
        }]

        DT[w, q := paste0(q, q_new)]
        DT[, q_new := NULL ]
    }

    setorder(DT[], counts, q, x, y)[]
}


system.time(res <- part.test(x,y))
#    user  system elapsed 
#    2.65    0.00    2.66 

system.time(fres <- f(x,y))
#    user  system elapsed 
#    0.65    0.05    0.70

# verify they match
fsetequal(
  zf <- setnames(copy(fres), "q", "quadrant"), 
  z <- copy(res)[, prior.quadrant := NULL ]
) # TRUE

Maybe why it's faster:

  • GForce is used to compute the means in mDT.
  • Arithmetic is used instead of ifelse.
  • Keying/sorting is only done once.

It could probably be even faster than this.

Frank
  • 66,179
  • 8
  • 96
  • 180