13

Consider a sorted vector x that is bounded between min and max. Below is an example of such x where min could be 0 and max could be 12:

x = c(0.012, 1, exp(1), exp(1)+1e-55, exp(1)+1e-10,
       exp(1)+1e-3, 3.3, 3.33333, 3.333333333333333, 3+1/3, 5, 5, 10, 12)

5 and 5 as well as exp(1) and exp(1)+10^(-55) have exactly the same value (to the level of accuracy of a float number). Some other entry differ largely and some others differ only by a small amount. I would like to consider an approximation to equality test

ApproxEqual = function(a,b) abs(a-b) < epsilon

, where epsilon could be 1e-5 for example.

Goal

I would like to modify the values of the variable x "as little as possible" to ensure that no two values in x are "approximatively equal" and x is still bounded between min and max.

I am happy to let you decide what "as little as possible" really mean. One could for example minimize the sum of square deviations between the original x and the expected variable output.

Example 1

x_input = c(5, 5.1, 5.1, 5.1, 5.2)
min=1
max=100

x_output = c(5, 5.1-epsilon, 5.1, 5.1+epsilon, 5.2)

Example 2

x_input = c(2,2,2,3,3)
min=2
max=3

x_output = c(2, 2+epsilon, 2+2*epsilon, 2+3*epsilon, 3-epsilon,3)

Of course, in the above case if (3-epsilon) - (2+3*epsilon) < epsilon is TRUE, then the function should throw an error as the problem has no solution.

Side Note

I would love if the solution is quite performant. answer could make could use of Rcpp for example.

Remi.b
  • 17,389
  • 28
  • 87
  • 168
  • 1
    not quite what you want but straight forward `sort(jitter(x_input, amount=1e-5))` – user20650 May 30 '16 at 23:20
  • @user20650 I can't tell from the man page, but does jitter guarantee no collisions? – Shape May 30 '16 at 23:30
  • look at the very last line of the code of `jitter` - it is random draws from a uniform dist. – so possible but unlikely – user20650 May 30 '16 at 23:50
  • 2
    I would do ^^^ but maybe `ifelse(duplicated(x_input), jitter(x_input, amount = 1e-5), x_input)` instead. and you can work in your tolerance into the conditional instead of `duplicated` – rawr May 31 '16 at 00:58
  • Thanks for the comment @rawr. Randomly adding small values and sorting won't ensure the values will be strictly increasing (and not to the level of the approximation I want to do). I tried to implement it in a big while loop that breaks whenever the output is as expected but it sometimes run for (seemingly?) ever as I have relatively big vectors that may contain 5 or 10 very close values. Please note that I edited the post to specify bounds to `x` to generalize the function. Thanks for your help! – Remi.b Jun 04 '16 at 01:48
  • why not add the tags `c`, and `c++` (any solution is c or c++ will work in rcpp) and `language-agnostic` (its really the algorithm you're after rather than a specific implementation). This will broaden you're audience. – dww Jun 04 '16 at 18:13
  • Just did. Thanks for the advice – Remi.b Jun 04 '16 at 18:25
  • Added code to my answer below. You may find it helpful. Works for all sorted arrays (including the examples given). – cameronroytaylor Jun 10 '16 at 23:59

4 Answers4

6

I doubt this is possible without iterating, because shifting some points away from neighbours that are too close could cause the moved points to bunch up closer to their other neighbours. Here is one solution that only changes those values that are necessary to arrive at a solution, and moves them by the smallest distance it can to ensure a minimum gap of epsilon.

It uses a function that assigns a force to each point depending whether we need to move it away from a neighbour that is too close. Direction (sign) of the force indicates whether we need to increase or decrease the value of that point. Points that are sandwiched between other too-near neighbours do not move, but their outside neighbours both move away from the central point (this behaviour to move as few points as we can by as little as possible). The force assigned to the end points is always zero, because we do not want overall range of x to change

force <- function(x, epsilon){
 c(0, sapply(2:(length(x)-1), function(i){ (x[i] < (x[i-1]+epsilon)) - (x[i] > (x[i+1]-epsilon)) }), 0)
}

Next, we need a function to shift points, depending on the force acting on them. Positive forces cause them to move to epsilon higher than the previous point. Negative forces shift them downwards.

move <- function(x, epsilon, f){
  x[which(f==-1)] <- x[which(f==-1)+1] - epsilon 
  x[which(f==1)]  <- x[which(f==1)-1] + epsilon
  # Next line deals with boundary condition, and prevents points from bunching up at the edges of the range
  # I doubt this is necessary, but included out of abundance of caution. Could try deleting this line if performance is an issue.
  x <- sapply(1:(length(x)), function(i){x[i] <- max(x[i], head(x,1)+(i-1)*epsilon); x[i] <- min(x[i], tail(x,1)-(length(x)-i)*epsilon)})
  x
}

Finally, the function separate is used to iteratively calculate force and move points until a solution is found. It also checks for a couple of edge cases before iterating.

separate <- function(x,epsilon) {
  if (epsilon > (range(x)[2] - range(x)[1]) / (length(x) - 1)) stop("no solution possible")
  if (!(all(diff(x)>=0))) stop ("vector must be sorted, ascending")

  initial.x <- x
  solved <- FALSE

  ##################################
  # A couple of edge cases to catch
  ##################################
  # 1. catch cases when vector length < 3 (nothing to do, as there are no points to move)
  if (length(x)<3) solved <- TRUE
  # 2. catch cases where initial vector has values too close to the boundaries 
  x <- sapply(1:(length(x)), function(i){
    x[i] <- max(x[i], head(x,1)+(i-1)*epsilon)
    x[i] <- min(x[i], tail(x,1)-(length(x)-i)*epsilon)
  })

  # Now iterate to find solution
  it <- 0
  while (!solved) {
    it <-  it+1
    f <- force(x, epsilon)
    if (sum(abs(f)) == 0) solved <- TRUE
    else x <- move(x, epsilon, f)
  }
  list(xhat=x, iterations=it, SSR=sum(abs(x-initial.x)^2))
}

Testing this on the example provided by OP:

x = c(0.012, 1, exp(1), exp(1)+1e-55, exp(1)+1e-10, exp(1)+1e-3, 3.3, 3.33333, 3.333333333333333, 3+1/3, 5, 5, 10, 12)
epsilon <- 1e-5

separate(x, epsilon)
# $xhat
# [1]  0.012000  1.000000  2.718272  2.718282  2.718292  2.719282  3.300000  3.333323  3.333333  3.333343
# [11]  4.999990  5.000000 10.000000 12.000000
#
# $iterations
# [1] 2
#
# $SSR
# [1] 4.444424e-10

Edit 1

Lines were added to function separate in response to comment to catch a couple of edge cases -

A) where vector passed to function has length < 3

separate(c(0,1), 1e-5)
# $xhat
# [1] 0 1
# 
# $iterations
# [1] 0
# 
# $SSR
# [1] 0

B) where passed vector has several values at the boundaries

separate(c(0,0,0,1), 1e-5)
# [1] "it = 1, SSR = 5e-10"
# $xhat
# [1] 0e+00 1e-05 2e-05 1e+00
# 
# $iterations
# [1] 1
#
# $SSR
# [1] 5e-10
dww
  • 30,425
  • 5
  • 68
  • 111
  • Thanks. Can you please comment a little bit? What is `F`, what is `x` and do you have boundaries (`min`, `max`)? – Remi.b Jun 04 '16 at 19:07
  • `F` is shorthand for `FALSE`. `x` is the name of the sorted vector, as per your original question. The boundaries are the 1st and last elements of x, which will never shift (because the range of x should remain constant). – dww Jun 04 '16 at 19:21
  • Looks like an interesting solution. A few issues though: `separate(c(0,0,0,1),1e-5)` runs forever and `separate(c(0,1),1e-5)` causes an error. – Remi.b Jun 05 '16 at 00:13
  • some lines added to catch these edge cases – dww Jun 05 '16 at 01:39
  • Cool +1! I have tried `set.seed(12);separate(runif(12,0,1),1e-6)` as well and it runs for seemingly ever. – Remi.b Jun 05 '16 at 03:32
  • 1
    @Remi.b, you assume in your question that the vector is already sorted, so shouldn't that be: `set.seed(12); separate(sort(runif(12, 0, 1)), 1e-6)`, in which case the process is finished in no time flat? (mainly because there is no processing to be done, the sequence already meets the requirement) – AkselA Jun 05 '16 at 06:19
  • Yes, @AkselA, you are right. The algorithm does assume a sorted vector (as specified in OP). It is of course trivial to add a `x <- sort(x)` command, if needed, to generalise this to work also on unsorted vectors. Or could check for vectors that fail this test with `if (!(all(diff(x)>=0))) stop ("vector must be sorted, ascending")` – dww Jun 05 '16 at 12:10
  • Oops that's right. Sorry for the useless comment. Thanks a lot for your answer! I'll wait a little bit to see if other answers pop up and to let other users "peer-review" your question and I'll accept it then. – Remi.b Jun 05 '16 at 14:58
3

This was a fun challenge, and I think I've worked out a solution. It's a bit ugly and convoluted and could do with some streamlining, but it seems to return what Remi asked for.

library(magrittr)

xin <- c(0.012, 1, exp(1), exp(1)+10^(-55), exp(1)+10^(-10),
    exp(1)+10^(-3), 3.3, 3.33333, 3.333333333333333, 3+1/3, 5, 5, 10, 12)

tiebreaker <- function(x, t=3) {
    dif <- diff(x) %>% round(t)
    x[dif==0] <- x[dif==0] + 
        seq(-10^-t, -10^-(t+0.99), 
        length.out=length(x[dif==0])) %>% sort
    x
}

xout <- tiebreaker(xin)

diff(xin) > 0.0001
# TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE

diff(xout) > 0.0001  #it makes close matches less close
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

xin == xout  #but leaves already less close matches as they were
# TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE

EDIT: I wrapped it up into a simple function. tr sets the threshold for what's considered a close match, in decimal points.

AkselA
  • 8,153
  • 2
  • 21
  • 34
  • 3
    your answer requires magrittr? which you used `%>%` once? why? – rawr May 31 '16 at 01:00
  • 1
    Twice, because i prefer piping over nested parentheses. If you want to avoid using magrittr it isn't hard to change. – AkselA May 31 '16 at 01:05
  • This does not constrain values to stay within the range (e.g. try it on `tiebreaker(c(0,0,0,1))`). Neither does it minimise the change, because it has a bias that values are always shifted downwards (try `tiebreaker(c(0,1,1,1,2))`. Also, not clear that you can use it for generalised values of epsilon that are not powers of 10. – dww Jun 10 '16 at 21:58
  • Yes, it absolutely has its shortcomings. It will also not work well on long sequences with many close values. It's a quick and dirty solution, but it should suffice unless you have very demanding requirements. – AkselA Jun 11 '16 at 09:39
2

Assuming that the values are sorted in ascending order, it seems easiest to do this with two for-loops. The first for-loop observes each number, and the second (inner) for-loop compares with all numbers before each number. If ApproxEqual is true, 1e-5 is added in the inner for-loop to the value parsed by the outer for-loop.

Here's code that does the trick:

x = c(5, 5.1, 5.1, 5.1, 5.2)

epsilon <-1e-5
ApproxEqual = function(a,b) abs(a-b) < epsilon

for (i in 1:length(x)){
  if (i>1){
    for (j in 1:(i-1)){
      if (ApproxEqual(x[i],x[j])){
        x[i]=x[i]+epsilon
      }
    }
  }
}

print(x)

This gives

> print(x)
[1] 5.00000 5.10000 5.10001 5.10002 5.20000
cameronroytaylor
  • 1,578
  • 1
  • 17
  • 18
2
  • It is not always possible to modify the values of variables to ensure that no two values are approximately equal and still bounded between min and max without modifying min or max. E. g. min=0, max=epsilon/2.

  • You might iteratively find nearest neighbours and change their values (if needed and if possible) to make those not approximately equal. Algorithms for searching nearest neighbours are well known. https://en.wikipedia.org/wiki/Nearest_neighbor_search

Jan Korous
  • 666
  • 4
  • 11