0

We are looking for an efficient algorithm to solve the following problem:

Given two increasingly sorted arrays. Find the closest corresponding elements in each array that difference is below a user given threshold. But only the closest of possible candidates (in the range of array1[i] +/- threshold) should be returned. The second closest could be matched to another element but matches to more than one element are not allowed. If two elements in array1 have the same distance to array2[j] the first (leftmost) match should be reported.

The arrays can contain duplicated values. There the first (leftmost) match should be reported (and all the others ignored/not matched).

Examples:

x: 1, 3, 5, 6, 8
y: 3, 4, 5, 7
threshold: 1
output: NA, 1, 3, 4, NA
(index of y that matches best to x)

x: 1, 1.5, 2, 2.1, 5, 6.1, 7.2
y: 4.6, 4.7, 4.8, 4.9, 5, 6, 7, 8
threshold: 3
output: NA, NA, NA, 1, 5, 6, 7
(index of y that matches best to x)

x: 1, 1, 1, 2, 2, 2
y: 1, 2
threshold: 0
output: 1, NA, NA, 2, NA, NA
(index of y that matches best to x, for duplicates choose to first one)

x: 1, 2
y: 1, 1, 1, 2, 2, 2
threshold: 0
output: 1, 4
(index of y that matches best to x, for duplicates choose to first one)

We use this to find the closest matching values between two m/z-values (mass-to-charge ratios) while comparing mass spectra.

Currently we iterate through both arrays and lookahead the differences for the next two elements and correct the previous element if a closer one was found. But this fails for more than two duplicate elements in a row (second example):

Our current implementation (C code as part of an R package): https://github.com/rformassspectrometry/MsCoreUtils/blob/master/src/closest.c#L73-L129

A commented version below:

SEXP C_closest_dup_closest(SEXP x, SEXP table, SEXP tolerance, SEXP nomatch) {
    /* x is the first array of doubles */
    double *px = REAL(x);
    const unsigned int nx = LENGTH(x);

    /* table is the second array of doubles where x should be matched against */
    double *ptable = REAL(table);
    const unsigned int ntable = LENGTH(table);

    /* user given tolerance threshold */
    double *ptolerance = REAL(tolerance);

    /* integer array to store the results */
    SEXP out = PROTECT(allocVector(INTSXP, nx));
    int* pout = INTEGER(out);

    /* integer that should returned if no valid match or a closer one was found */
    const unsigned int inomatch = asInteger(nomatch);

    /* indices */
    unsigned int ix = 0, ixlastused = 1;
    unsigned int itbl = 0, itbllastused = 1;
    /* differences: current, difference to next element of x and table, respectively */
    double diff = R_PosInf, diffnxtx = R_PosInf, diffnxttbl = R_PosInf;

    while (ix < nx) {
        if (itbl < ntable) {
            /* difference for current pair */
            diff = fabs(px[ix] - ptable[itbl]);
            /* difference for next pairs */
            diffnxtx =
                ix + 1 < nx ? fabs(px[ix + 1] - ptable[itbl]) : R_PosInf;
            diffnxttbl =
                itbl + 1 < ntable ? fabs(px[ix] - ptable[itbl + 1]) : R_PosInf;

            if (diff <= ptolerance[ix]) {
                /* valid match, add + 1 to convert between R/C index */
                pout[ix] = itbl + 1;
                if (itbl == itbllastused &&
                        (diffnxtx < diffnxttbl || diff < diffnxttbl))
                    pout[ixlastused] = inomatch;
                ixlastused = ix;
                itbllastused = itbl;
            } else
                pout[ix] = inomatch;

            if (diffnxtx < diff || diffnxttbl < diff) {
                /* increment the index with the smaller distance */
                if (diffnxtx < diffnxttbl)
                    ++ix;
                else
                    ++itbl;
            } else {
                /* neither next x nor next table item offer a better match */
                ++ix;
                ++itbl;
            }
        } else
            pout[ix++] = inomatch;
    }

    /* R provided MACRO to free allocated memory */
    UNPROTECT(1);
    return out;
}

Could anybody give us a hint for a better algorithm?

sgibb
  • 25,396
  • 3
  • 68
  • 74
  • Index [ + 1]? So the elements of `y` are only allowed to be used once? Sounds like an optimization problem that could use dynamic programming. – Neil Dec 26 '21 at 15:45
  • You could just do a binary search for each element of `x` in `y`. Test which of the elements adjacent to the binsearch result in `y` is closer to the element of `x`, then check whether it is within tolerance. If so, remove the element from `y` (to avoid matching it multiple times), store it as a match, continue with the next element of `x`. For optimization: Because `x` is sorted, you can adjust your starting position in `y` forwards for the binsearch occasionally (however note this pathological case: a later element of `x` can match an earlier element of `y`, if tolerance permits). – EOF Dec 26 '21 at 15:48
  • @Neil: indeed each element is just allowed to be used once. – sgibb Dec 26 '21 at 15:51
  • @EOF: thanks for your suggestion. I guess I have to remove the duplicates before to ensure they are not matched. – sgibb Dec 26 '21 at 15:54
  • It is not clear for me how to handle the case with duplicates in both arrays. – Damien Dec 26 '21 at 15:57
  • x: [2,3,4,5,6], y: [1,2,3,4,5], would the matches be 2-1, ... 6-5, or 2-2, 5-5, and not match 6? I think you need a metric to answer this, right? – Neil Dec 26 '21 at 16:00
  • @Damien: in case of duplicates always the first occurrence (leftmost) element should be matched, e.g. for `x: 1, 1, 2, 2; y 1, 1, 2, 2; threshold: 0;` the output should be `1, 3. – sgibb Dec 26 '21 at 16:00
  • @Neil: `2-2, 3-3, 4-4, 5-5` and not matched: `1, 6`; We just use the absolute distance/difference as metric? Or did I misunderstand? – sgibb Dec 26 '21 at 16:03
  • Then you have one less match. If you are okay with that, (_ie_, a cost of 0 for unmatched numbers), I think a greedy algorithm would be fine. – Neil Dec 26 '21 at 16:08
  • @Neil: I am fine when not all elements are matched. That's by intention. I am not a trained programmer, I have to read about greedy algorithms to answer whether it could be useful or not. – sgibb Dec 26 '21 at 16:12
  • [2,3,4,5,6] and [6,7,8,9,10] should only match 6-6; it's greedy because each step takes the local optimum, _vs_ say you could double your matches with a slightly less optimum 5-6, 6-7. Greedy generally is quicker. – Neil Dec 26 '21 at 16:22
  • 1
    @Neil: "[2,3,4,5,6] and [6,7,8,9,10] should only match 6-6;" correct! Thanks, we don't care for more possible matches and run time is important because in spectra comparison/spectra library search etc. this function would be called thousand times. – sgibb Dec 26 '21 at 16:31

0 Answers0