I have ordered lists of numbers (like barcode positions, spectral lines) that I am trying to compare for similarity. Ideally, I would like to compare two lists to get a value from 1.0 (match) degrading gracefully to 0.
The lists could be offset by an arbitrary amount, and that should not degrade the match. The diffs between adjacent items are the most applicable characterization.
Due to noise in the system, some items may be missing (alternatively, extra items may be inserted, depending on point of view).
The diff values may be reordered.
The diff values may be scaled.
Multiple transformations above may be applied and each should reduce similarity proportionally.
Here is some test data:
# deltas
d = [100+(i*10) for i in xrange(10)] # [100, 110, 120, 130, 140, 150, 160, 170, 180, 190]
d_swap = d[:4] + [d[5]] + [d[4]] + d[6:] # [100, 110, 120, 130, 150, 140, 160, 170, 180, 190]
# absolutes
a = [1000+j for j in [0]+[sum(d[:i+1]) for i in xrange(len(d))]] # [1000, 1100, 1210, 1330, 1460, 1600, 1750, 1910, 2080, 2260, 2450]
a_offs = [i+3000 for i in a] # [4000, 4100, 4210, 4330, 4460, 4600, 4750, 4910, 5080, 5260, 5450]
a_rm = a[:2] + a[3:] # [1000, 1100, 1330, 1460, 1600, 1750, 1910, 2080, 2260, 2450]
a_add = a[:7] + [(a[6]+a[7])/2] + a[7:] # [1000, 1100, 1210, 1330, 1460, 1600, 1750, 1830, 1910, 2080, 2260, 2450]
a_swap = [1000+j for j in [0]+[sum(d_swap[:i+1]) for i in xrange(len(d_swap))]] # [1000, 1100, 1210, 1330, 1460, 1610, 1750, 1910, 2080, 2260, 2450]
a_stretch = [1000+j for j in [0]+[int(sum(d[:i+1])*1.1) for i in xrange(len(d))]] # [1000, 1110, 1231, 1363, 1506, 1660, 1825, 2001, 2188, 2386, 2595]
a_squeeze = [1000+j for j in [0]+[int(sum(d[:i+1])*0.9) for i in xrange(len(d))]] # [1000, 1090, 1189, 1297, 1414, 1540, 1675, 1819, 1972, 2134, 2305]
Sim(a, a_offs) should be 1.0 since offset is not considered a penalty.
Sim(a, a_rm) and Sim(a, a_add) should be about 0.91 because 10 of 11 or 11 of 12 match.
Sim(a, a_swap) should be about 0.96 because one diff is out of place (possibly with a further penalty based on distance if moved more than one position).
Sim(a, a_stretch) and Sim(a, a_squeeze) should be about 0.9 because diffs were scaled by about 1 part in 10.
I am thinking of something like difflib.SequenceMatcher
but that works for numeric values with fuzziness instead of hard-compared hashables. It would also need to retain some awareness of the diff (first derivative) relationship.
This seems to be a dynamic programming problem, but I can't figure out how to construct an appropriate cost metric.