14

I would like to find out the three closest numbers in a vector. Something like

v = c(10,23,25,26,38,50)
c = findClosest(v,3)
c
23 25 26

I tried with sort(colSums(as.matrix(dist(x))))[1:3], and it kind of works, but it selects the three numbers with minimum overall distance not the three closest numbers.

There is already an answer for matlab, but I do not know how to translate it to R:

%finds the index with the minimal difference in A
minDiffInd = find(abs(diff(A))==min(abs(diff(A))));
%extract this index, and it's neighbor index from A
val1 = A(minDiffInd);
val2 = A(minDiffInd+1);

How to find two closest (nearest) values within a vector in MATLAB?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Terry
  • 141
  • 5
  • 2
    If you replace `find` with `which` (and use `[` for array/matrix indexing), the Matlab answer will work in R, but obviously only works to find the closest 2. Can you clarify what you mean _exactly_ by "finding the closest values in a vector"? The matlab answer only works if the vector is sorted, is that a fair assumption? Your title says "two" but your example uses "3", which is it? A solution working for arbitrary `n` is much harder that one that only works for 2. The matlab answer does not extend to >2 numbers, is that why you're asking? – asachet Jul 31 '19 at 08:34
  • Hi, yes I fixed the title. In my case I need three. Following your suggestion I have adapted the code from MATLAB and it works, but it only finds the two closest numbers. How should I adapt it to find also the third? The vector can be sorted, it is just a group of numbers and I have to pick the three closer replicas. – Terry Jul 31 '19 at 08:47

5 Answers5

16

My assumption is that the for the n nearest values, the only thing that matters is the difference between the v[i] - v[i - (n-1)]. That is, finding the minimum of diff(x, lag = n - 1L).

findClosest <- function(x, n) {
  x <- sort(x)
  x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
}

findClosest(v, 3L)

[1] 23 25 26
Cole
  • 11,130
  • 1
  • 9
  • 24
7

Let's define "nearest numbers" by "numbers with minimal sum of L1 distances". You can achieve what you want by a combination of diff and windowed sum.

You could write a much shorter function but I wrote it step by step to make it easier to follow.

v <- c(10,23,25,26,38,50)

#' Find the n nearest numbers in a vector
#'
#' @param v Numeric vector
#' @param n Number of nearest numbers to extract
#'
#' @details "Nearest numbers" defined as the numbers which minimise the
#'   within-group sum of L1 distances.
#'   
findClosest <- function(v, n) {
  # Sort and remove NA
  v <- sort(v, na.last = NA)

  # Compute L1 distances between closest points. We know each point is next to
  # its closest neighbour since we sorted.
  delta <- diff(v)

  # Compute sum of L1 distances on a rolling window with n - 1 elements
  # Why n-1 ? Because we are looking at deltas and 2 deltas ~ 3 elements.
  withingroup_distances <- zoo::rollsum(delta, k = n - 1)

  # Now it's simply finding the group with minimum within-group sum
  # And working out the elements
  group_index <- which.min(withingroup_distances)
  element_indices <- group_index + 0:(n-1)

  v[element_indices]
}

findClosest(v, 2)
# 25 26
findClosest(v, 3)
# 23 25 26
asachet
  • 6,620
  • 2
  • 30
  • 74
  • Thanks, I have implemented it and it works great! Thanks also for the explanation, I understood the logic behind it. – Terry Jul 31 '19 at 10:50
  • 1
    Interestingly, this solution can very easily be extended to use another norm such as L2 instead of L1, if you want to penalise larger gaps more. For example, (10,20,30) and (50,55,70) are equally near according to L1 (10+10=5+15) but the first group is better according to L2 (10^2+10^2 < 5^2+15^2). – asachet Aug 01 '19 at 07:04
  • Very interesting. Actually, I think I am gonna give it a try since I would like to find the three numbers with minimum variance between them. The L1 does not allow them, L2 instead would allow me to select the group with minimum variance. Thanks very much! – Terry Aug 01 '19 at 16:24
  • 1
    You're welcome, you just have to use `delta^2` in the `rollsum` – asachet Aug 01 '19 at 16:29
7

A base R option, idea being we first sort the vector and subtract every ith element with i + n - 1 element in the sorted vector and select the group which has minimum difference.

closest_n_vectors <- function(v, n) {
   v1 <- sort(v)
   inds <- which.min(sapply(head(seq_along(v1), -(n - 1)), function(x) 
                     v1[x + n -1] - v1[x]))
   v1[inds: (inds + n - 1)]
}

closest_n_vectors(v, 3)
#[1] 23 25 26

closest_n_vectors(c(2, 10, 1, 20, 4, 5, 23), 2)
#[1] 1 2

closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 2)
#[1] 65 67

closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 3)
#[1]  1 19 23

In case of tie this will return the numbers with smallest value since we are using which.min.


BENCHMARKS

Since we have got quite a few answers, it is worth doing a benchmark of all the solutions till now

set.seed(1234)
x <- sample(100000000, 100000)

identical(findClosest_antoine(x, 3), findClosest_Sotos(x, 3), 
          closest_n_vectors_Ronak(x, 3), findClosest_Cole(x, 3))
#[1] TRUE

microbenchmark::microbenchmark(
    antoine = findClosest_antoine(x, 3),
    Sotos = findClosest_Sotos(x, 3), 
    Ronak  = closest_n_vectors_Ronak(x, 3),
    Cole = findClosest_Cole(x, 3),
    times = 10
)



#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
#antoine  148.751  159.071  163.298  162.581  167.365  181.314    10  b 
#  Sotos 1086.098 1349.762 1372.232 1398.211 1453.217 1553.945    10   c
#  Ronak   54.248   56.870   78.886   83.129   94.748  100.299    10 a  
#   Cole    4.958    5.042    6.202    6.047    7.386    7.915    10 a  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    @Cole I am not sure about `cld` either but I have it in my output. Yes, @Rui's solution was not `identical`. I didn't check that earlier. – Ronak Shah Jul 31 '19 at 11:18
  • Could be reasonable to throw in an `abs()` when computing the difference to account negative numbers – boski Aug 29 '19 at 07:46
6

An idea is to use zoo library to do a rolling operation, i.e.

library(zoo)
m1 <- rollapply(v, 3, by = 1, function(i)c(sum(diff(i)), c(i)))
m1[which.min(m1[, 1]),][-1]
#[1] 23 25 26

Or make it into a function,

findClosest <- function(vec, n) {
    require(zoo)
    vec1 <- sort(vec)
    m1 <- rollapply(vec1, n, by = 1, function(i) c(sum(diff(i)), c(i)))
    return(m1[which.min(m1[, 1]),][-1])
}

findClosest(v, 3)
#[1] 23 25 26
Sotos
  • 51,121
  • 6
  • 32
  • 66
0

For use in a dataframe,

data%>%
group_by(var1,var2)%>%
do(data.frame(findClosest(.$val,3)))
hnguyen
  • 772
  • 6
  • 17