1

I have ordered lists of numbers (like barcode positions, spectral lines) that I am trying to compare for similarity. Ideally, I would like to compare two lists to get a value from 1.0 (match) degrading gracefully to 0.

The lists could be offset by an arbitrary amount, and that should not degrade the match. The diffs between adjacent items are the most applicable characterization.

Due to noise in the system, some items may be missing (alternatively, extra items may be inserted, depending on point of view).

The diff values may be reordered.

The diff values may be scaled.

Multiple transformations above may be applied and each should reduce similarity proportionally.

Here is some test data:

# deltas
d = [100+(i*10) for i in xrange(10)]  # [100, 110, 120, 130, 140, 150, 160, 170, 180, 190]
d_swap = d[:4] + [d[5]] + [d[4]] + d[6:]  # [100, 110, 120, 130, 150, 140, 160, 170, 180, 190]
# absolutes
a = [1000+j for j in [0]+[sum(d[:i+1]) for i in xrange(len(d))]]  # [1000, 1100, 1210, 1330, 1460, 1600, 1750, 1910, 2080, 2260, 2450]
a_offs = [i+3000 for i in a]  # [4000, 4100, 4210, 4330, 4460, 4600, 4750, 4910, 5080, 5260, 5450]
a_rm = a[:2] + a[3:]  # [1000, 1100, 1330, 1460, 1600, 1750, 1910, 2080, 2260, 2450]
a_add = a[:7] + [(a[6]+a[7])/2] + a[7:]  # [1000, 1100, 1210, 1330, 1460, 1600, 1750, 1830, 1910, 2080, 2260, 2450]
a_swap = [1000+j for j in [0]+[sum(d_swap[:i+1]) for i in xrange(len(d_swap))]]  # [1000, 1100, 1210, 1330, 1460, 1610, 1750, 1910, 2080, 2260, 2450]
a_stretch = [1000+j for j in [0]+[int(sum(d[:i+1])*1.1) for i in xrange(len(d))]]  # [1000, 1110, 1231, 1363, 1506, 1660, 1825, 2001, 2188, 2386, 2595]
a_squeeze = [1000+j for j in [0]+[int(sum(d[:i+1])*0.9) for i in xrange(len(d))]]  # [1000, 1090, 1189, 1297, 1414, 1540, 1675, 1819, 1972, 2134, 2305]

Sim(a, a_offs) should be 1.0 since offset is not considered a penalty.
Sim(a, a_rm) and Sim(a, a_add) should be about 0.91 because 10 of 11 or 11 of 12 match.
Sim(a, a_swap) should be about 0.96 because one diff is out of place (possibly with a further penalty based on distance if moved more than one position).
Sim(a, a_stretch) and Sim(a, a_squeeze) should be about 0.9 because diffs were scaled by about 1 part in 10.

I am thinking of something like difflib.SequenceMatcher but that works for numeric values with fuzziness instead of hard-compared hashables. It would also need to retain some awareness of the diff (first derivative) relationship.

This seems to be a dynamic programming problem, but I can't figure out how to construct an appropriate cost metric.

verbamour
  • 945
  • 9
  • 16
  • Can you also have a scaling factor? This would complicate things. For instance, is `[1, 2, 3]` *very* similar to `[1000, 2000, 3000]` ? If not, I have some ideas ... – Prune May 02 '18 at 20:59
  • Yes. As a_stretch and a_squeeze show, scaling is something I have to accommodate. In your case (`Sim([1, 2, 3], [1000, 2000, 3000])`), the scaling is a factor of 1000, so that would be a 0.001 match. However, if you have a technique that would do everything but scaling, I would be interested. – verbamour May 03 '18 at 14:21
  • Apologies; I read too fast. I'll think about that today. – Prune May 03 '18 at 15:56
  • How "regular are the list elements? I'm thinking of how to normalize the data to recognize a variety of situations, align the elements that are supposed to match, etc. Do we get outliers, such as (1, 2000, 3000, 4000) vs (1001, 2001, 3001, 4001)? You're trying to recognize a full linear transformation with gaps and transpositions. – Prune May 03 '18 at 16:35
  • In my test data, I ensured the diffs (deltas) were distinct and evenly spaced for demonstration purposes. In real data, I would expect the diffs to be more randomly distributed, with collisions (duplicate diffs) and a potentially large range of diff sizes. I thought about comparing distribution of diffs, but I don't want to lose ordering information. I would score `Sim([1, 2000, 3000, 4000], [1001, 2001, 3001, 4001])` as 0.75 since three latter 1000 diffs matched perfectly, but the 1999 diff did not match the first 1000 diff. – verbamour May 03 '18 at 17:30

0 Answers0