We are looking for an efficient algorithm to solve the following problem:
Given two increasingly sorted arrays. Find the closest corresponding elements in each array that difference is below a user given threshold. But only the closest of possible candidates (in the range of
array1[i] +/- threshold
) should be returned. The second closest could be matched to another element but matches to more than one element are not allowed. If two elements inarray1
have the same distance toarray2[j]
the first (leftmost) match should be reported.The arrays can contain duplicated values. There the first (leftmost) match should be reported (and all the others ignored/not matched).
Examples:
x: 1, 3, 5, 6, 8
y: 3, 4, 5, 7
threshold: 1
output: NA, 1, 3, 4, NA
(index of y that matches best to x)
x: 1, 1.5, 2, 2.1, 5, 6.1, 7.2
y: 4.6, 4.7, 4.8, 4.9, 5, 6, 7, 8
threshold: 3
output: NA, NA, NA, 1, 5, 6, 7
(index of y that matches best to x)
x: 1, 1, 1, 2, 2, 2
y: 1, 2
threshold: 0
output: 1, NA, NA, 2, NA, NA
(index of y that matches best to x, for duplicates choose to first one)
x: 1, 2
y: 1, 1, 1, 2, 2, 2
threshold: 0
output: 1, 4
(index of y that matches best to x, for duplicates choose to first one)
We use this to find the closest matching values between two m/z-values (mass-to-charge ratios) while comparing mass spectra.
Currently we iterate through both arrays and lookahead the differences for the next two elements and correct the previous element if a closer one was found. But this fails for more than two duplicate elements in a row (second example):
Our current implementation (C code as part of an R package): https://github.com/rformassspectrometry/MsCoreUtils/blob/master/src/closest.c#L73-L129
A commented version below:
SEXP C_closest_dup_closest(SEXP x, SEXP table, SEXP tolerance, SEXP nomatch) {
/* x is the first array of doubles */
double *px = REAL(x);
const unsigned int nx = LENGTH(x);
/* table is the second array of doubles where x should be matched against */
double *ptable = REAL(table);
const unsigned int ntable = LENGTH(table);
/* user given tolerance threshold */
double *ptolerance = REAL(tolerance);
/* integer array to store the results */
SEXP out = PROTECT(allocVector(INTSXP, nx));
int* pout = INTEGER(out);
/* integer that should returned if no valid match or a closer one was found */
const unsigned int inomatch = asInteger(nomatch);
/* indices */
unsigned int ix = 0, ixlastused = 1;
unsigned int itbl = 0, itbllastused = 1;
/* differences: current, difference to next element of x and table, respectively */
double diff = R_PosInf, diffnxtx = R_PosInf, diffnxttbl = R_PosInf;
while (ix < nx) {
if (itbl < ntable) {
/* difference for current pair */
diff = fabs(px[ix] - ptable[itbl]);
/* difference for next pairs */
diffnxtx =
ix + 1 < nx ? fabs(px[ix + 1] - ptable[itbl]) : R_PosInf;
diffnxttbl =
itbl + 1 < ntable ? fabs(px[ix] - ptable[itbl + 1]) : R_PosInf;
if (diff <= ptolerance[ix]) {
/* valid match, add + 1 to convert between R/C index */
pout[ix] = itbl + 1;
if (itbl == itbllastused &&
(diffnxtx < diffnxttbl || diff < diffnxttbl))
pout[ixlastused] = inomatch;
ixlastused = ix;
itbllastused = itbl;
} else
pout[ix] = inomatch;
if (diffnxtx < diff || diffnxttbl < diff) {
/* increment the index with the smaller distance */
if (diffnxtx < diffnxttbl)
++ix;
else
++itbl;
} else {
/* neither next x nor next table item offer a better match */
++ix;
++itbl;
}
} else
pout[ix++] = inomatch;
}
/* R provided MACRO to free allocated memory */
UNPROTECT(1);
return out;
}
Could anybody give us a hint for a better algorithm?