R: Average sequential values when their difference is below some threshold

Question

I'd like to take an increasing sequence of numbers (e.g. a series of times)

set.seed(41);  d <- seq(1:100) + runif(100, 0, 1)

and if the difference between two sequential numbers is below a threshold, merge them into a single point by taking the mean value of the two and, then continue going through until the next time combining is necessary. I resorted to functions I usually avoid: while and ifelse to write a quick-and-dirty function, and it works but isn't fast. Can you solve this task 1) more efficiently and 2) without invoking a for or while loop. Is there some built-in function, perhaps with even more functionality, that is well-suited for such a task?

combine_points <- function(x, th=0.5)
{ 
    i = 1                           # start i at 1
    while(min(diff(x)) < th)        # initiate while loop
    {
    ifelse(x[i+1] - x[i] < th,      # logical condition
           x[i] <- x[i+1] <- 
           mean(c(x[i+1], x[i])),      # assignment if TRUE
           (x[i] <- x[i]))          # assignment if FALSE
    x <- sort(unique(x))            # get rid of the duplicated entry created when
                                    # the ifelse statement was TRUE
    # increment i or reset i to 1 if it gets too large
    ifelse(i == length(x), i <- 1, i <- i+1 )
    }
    return(x)
}

newd <- combine_points(d)  
th <- 0.5
which(diff(newd) < th)

integer(0)

Update to benchmarks of solutions so far.

I benchmarked with a larger sample vector, and the Rcpp solution suggested by @Roland is slower than my first while loop when the vector gets long. I made an improvement to the initial while loop, and made an Rcpp version of it, too. The benchmark results are below. Note that @flodel answer is not directly comparable because it is a fundamentally different approach to combining, but it is definitely very fast.

set.seed(41);  d <- seq(1:4000) + runif(4000, 0, 1)
library(microbenchmark)
microbenchmark(
     combine_points.Frank(d,th=0.5),
     combine_points.Frank2(d,th=0.5),
     combine_points_Roland(d,th=0.5),
     combine_points_Roland2(d,th=0.5))
Unit: milliseconds
                                expr       min        lq    median        uq        max neval
   combine_points.Frank(d, th = 0.5) 2115.6391 2154.5038 2174.5889 2193.8444  7884.1638   100
  combine_points.Frank2(d, th = 0.5) 1298.2923 1323.2214 1341.5357 1357.4260 15538.0872   100
  combine_points_Roland(d, th = 0.5) 2497.9106 2506.5960 2512.3591 2519.0036  2573.2854   100
 combine_points_Roland2(d, th = 0.5)  494.8406  497.3613  498.2347  499.8777   544.9743   100

This is a considerable improvement over my first attempt, and the following is an Rcpp version, which is the fastest, so far:

combine_points.Frank2 <- function(x, th=0.5)
{ 
    i = 1
    while(min(diff(x)) < th)
    {
    if(x[i+1] - x[i] >= th){ 
           i <- i + 1}
    else {
           x[i] <- x[i+1] <- 
           mean(c(x[i+1], x[i]));x <- unique(x); i <- i }  
    }
    return(x)
}

Rcpp version

cppFunction('
 NumericVector combine_points_Roland2(NumericVector x, double th) {
  int i=0;
  while(min(diff(x)) < th)   
  {
    if ((x[i+1] - x[i]) >= th)
    {
      i = i + 1;
    }
    else{
      x[i] = (x[i+1] + x[i])/2;
      x[i+1] = x[i];
      x = sort_unique(x);
      i = i;
    }
   }
   return x;
}
')

`combine_points(c(0.1, 0.01, 0.2, 5))` results in `[1] 0.2 5.0` is this the correct result? Since you seem to want to apply the algorithm recursively (and recursion is not a good idea in R) I don't think you can avoid a loop. — Roland, Mar 22 '14 at 15:51
Imagine `x <- c(0.1, 0.2, 0.3)`: do you want the output to be the mean of all three: `0.2`? Your description seems to do `mean(c(mean(c(0.1, 0.2)), 0.3))` instead, which gives `0.225`. — flodel, Mar 22 '14 at 15:55
@Roland I had messed up calculating the mean, but I edited to fix it. I should note that your input vector would be unrealistic for me. Every sequential entry should be greater than the previous entry. — Jota, Mar 22 '14 at 16:02
@flodel Good question, it is something I was considering myself and hoped there was a function that could do both types of combining routines. However, in this example, 0.225 would be the correct result. I'm not entirely sure what the best way to do the combining of unrealistically close points is, but I know there is a lower limit on what is possible, and I need to address such problems. — Jota, Mar 22 '14 at 16:03
My answer would return `0.2`. To return `0.225` requires a process where computing iteration `i` depends on previous iterations. So I don't think you can do without a `for` loop or recursion, which will be slow. — flodel, Mar 22 '14 at 16:05
Your function should translate to C++ quite easily. I would use Rcpp here (and probably do it with recursion). — Roland, Mar 22 '14 at 16:08
@Roland Thanks for the tip. I just looked up some tutorials (o jeez, there are kind of too many), and will give it a shot. This is a good opportunity to learn something new! — Jota, Mar 22 '14 at 16:26

score 1 · Answer 1 · answered Mar 22 '14 at 16:02

1

See if this does what you want:

combine_points <- function(x, th=0.5) {
  group <- cumsum(c(FALSE, diff(x) > th))
  unname(sapply(split(x, group), mean))
}

combine_points(c(-1, 0.1, 0.2, 0.3, 1, 1.5, 2.0, 2.5, 3.0, 10), 0.5)
# [1] -1.0  0.2  2.0 10.0

answered Mar 22 '14 at 16:02

flodel

87,577
21
185
223

+1 That is excellent! Clever use of of `cumsum`. Ultimately, this is a post-processing step in a binary classification problem. I'll pit the output of this method against the other combine routine to see which performs better in relation to "ground-truth" data. – Jota Mar 22 '14 at 16:23
Are you sure this is what you want? For example: `combine_points(c(seq(1,5), seq(7,-1)), th = 1.1)` gives `c(3,3)`. If you want randoms that are guaranteed to be spaced by the threshold I would be more careful. – Robert Krzyzanowski Mar 23 '14 at 00:35
1

First line in OP's question mentions "an increasing sequence of numbers". – flodel Mar 23 '14 at 00:43

Roland · Answer 2 · 2014-03-23T09:57:31.670

Here is a translation of your function into Rcpp. It uses sugar functions, which are very convenient, but often there are faster alternatives (RcppEigen or RcppArmadillo are good for that). And of course the algorithm could be improved.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector combine_points1(NumericVector x, double th) {
  int i=0;
  while(min(diff(x)) < th)   
  {
    if ((x[i+1] - x[i]) < th)
    {
      x[i] = (x[i+1] + x[i])/2;
      x[i+1] = x[i];
    } 
    x = sort_unique(x);
    if(i <= x.size())
    {
      i = i+1;
    }
    else {
      i=1;
    }
  }
   return x;
}

I recommend using RStudio for writing Rcpp functions and sourcing them.

all.equal(combine_points1(d, 0.5),
          combine_points(d, 0.5))
#[1] TRUE

library(compiler)
combine_points_comp <- cmpfun(combine_points) 

library(microbenchmark)
microbenchmark(combine_points1(d, 0.5),
               combine_points_comp(d, 0.5),
               combine_points(d, 0.5))

# Unit: microseconds
#                        expr      min        lq    median        uq       max neval
#     combine_points1(d, 0.5)  652.772  664.6815  683.1315   714.653  1030.171   100
# combine_points_comp(d, 0.5) 8344.839 8692.0880 9010.1470 10627.049 14117.553   100
#      combine_points(d, 0.5) 8996.768 9371.0805 9687.0235 10560.226 12800.831   100

A speed-up by a factor of 14 without real effort.

Thanks a lot! I was struggling to get it right. I notice that you set, i=1 initially, but I thought the first index in c++ was 0. Is that not the case? Also, using std::vector instead of NumericVector should increase the speed, too, right? — Jota, Mar 22 '14 at 19:18
I'm not seeing the performance gains when I use a larger sample vector (see my update). I'm wrapping the code in `cppFunction('')`, instead of the way you're doing it above, but that shouldn't be a problem, right? For my real data, I'll be using this on 100s of vectors, each around 40,000 to 70,000 in length. — Jota, Mar 22 '14 at 22:20
As I wrote, this could probably be made much more efficient. The slow part of this is `x = sort_unique(x);`, which forces a resize/copy of the vector. Think about a better algorithm. My answer was just intended to show you how to use Rcpp. — Roland, Mar 23 '14 at 10:15

Roland · Accepted Answer · 2014-03-24T08:16:03.193

Here is something faster. It avoids resizing/copying x in the loop.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
 NumericVector combine_points_Roland3(NumericVector x, double th) {
  int i=0, j;
  int n(x.size());
  while(i < n-1)   
  {
    if ((x[i+1] - x[i]) >= th)
    {
      i = i + 1;
    }
    else{
      x[i] = (x[i+1] + x[i])/2;
      n = n-1;
      for (j=i+1; j<n; j++) 
      {
        x[j]=x[j+1];
      }      
    }
  }
  NumericVector y(n);
  for (i = 0; i < n; i++) {
    y[i] = x[i];
  }
   return y;
}

An R implementation of the same algorithm:

combine_points_Roland3R <- function(x, th) {
  i <- 1
  n <- length(x)

  while(i < n) {
    if ((x[i+1] - x[i]) >= th) {
      i <- i + 1;
    } else {
      x[i] <- (x[i+1] + x[i])/2
      n <- n-1
      x[(i+1):n] <- x[(i+2):(n+1)]
    }
  }
  x[1:n]
}

set.seed(41);  d <- seq(1:4000) + runif(4000, 0, 1)
x2 <- combine_points_Roland2(d, 0.5)
x3 <- combine_points_Roland3(d, 0.5)
all.equal(x2, x3)
#TRUE
x4 <- combine_points_Roland3R(d, 0.5)
all.equal(x2, x4)
#TRUE

Benchmarks:

library(microbenchmark)
microbenchmark(combine_points_Roland2(d, 0.5),
               combine_points_Roland3(d, 0.5), 
               combine_points_Roland3R(d, 0.5))

# Unit: microseconds
#                            expr       min         lq      median          uq        max neval
#  combine_points_Roland2(d, 0.5) 126458.64 131414.592 132355.4285 133422.2235 147306.728   100
#  combine_points_Roland3(d, 0.5)    121.34    128.269    140.8955    143.3595    393.582   100
# combine_points_Roland3R(d, 0.5)  17564.24  18626.878  19155.6565  20910.2935  68707.888   100

When I get to either loading `microbenchmark`, calling the `microbenchmark` function, or calling `system.time(replicate(100, combine_points_Roland3(d, 0.5)))` R 3.0.1 and 3.0.3 64-bit Windows 8 crashes every time. — Jota, Mar 23 '14 at 17:05
I don't have Windows at home and don't have access to Windows 8 at all, so I can't investigate that. I cannot see how the function could interfere with the microbenchmark package, but I consider myself a beginner with C++, so maybe I'm overlooking something. Might be worth a separate question. — Roland, Mar 23 '14 at 17:17
I think I figured it out. The culprit was the `n+1` in the final for loop. Changing it to `n` solves the problem, and shouldn't change the output, if I understand correctly. I think I was having the same problem as in [this other question](http://stackoverflow.com/questions/21890613/rcpp-function-crashes). Anyway, this is by far the fastest answer, and I'm going to accept it once you change the offending `n+1`, unless something faster comes along. — Jota, Mar 24 '14 at 00:02
@Frank Done. I've added an R implemetation for completeness. — Roland, Mar 24 '14 at 08:17

R: Average sequential values when their difference is below some threshold

Update to benchmarks of solutions so far.

3 Answers3