0

I have a given list of values and a collection of lists (lists A, B, and C) with similar values. I'm trying to find a way to return the list that most closely matches the given list. I'd like to use a least squares fit as the distance metric.

given = [0, 1, 2, 3, 4, 5]
A = [0.1, 0.9, 2, 3.3, 3.6, 5.1]
B = [-0.1, 0.9, 2.1, 3.1, 3.9, 5]
C = [0, 1.1, 2, 2.9, 4, 5.1]

So in this case, it would return C as the closest match to given.

I thought I could incorporate something like:

match = [min([val[idx] for val in [A,B,C]], key=lambda x: abs(x-given[idx])) for idx in range(len(given))]

But that only returns the closest value for each list element. I'm not sure how to then identify list C as the closest point-by-point match.

Also, if the lists are different lengths, I really don't know what to do if I'm not comparing them index by index. For example:

given = [0, 1, 2, 3, 4, 5]
A = [0.1, 0.9, 2, 3.3, 3.6, 2, 5.1, 3, 6.8, 7.1, 8.2, 9]
B = [-0.1, 0.9, 2.1, 3.1, 3.9]
C = [-1.7, -1, 0, 1.1, 2, 2.9, 4, 5.1, 6, 7.1, 8]

would still return C as the closest match.

I'm also using Numpy but haven't found anything useful. Any help would be greatly appreciated!

Joe Flip
  • 1,076
  • 4
  • 21
  • 37
  • 4
    I think you should begin by formalizing the required distance metric. In other words, what is it *exactly* that makes `given` closer to `C` than to `A` or `B`? Without this, the question is too vague to be answerable. – NPE Nov 26 '12 at 15:25
  • See this question on SO: http://stackoverflow.com/questions/9365184/computing-similarity-between-two-lists – asthasr Nov 26 '12 at 15:26
  • @NPE is right. I agree, some distance metric should be selected. – crow16384 Nov 26 '12 at 15:27

2 Answers2

1

The pure python solution isn't most efficient, but here's one implementation using least squares for the distance metric.

def distance(x,y):
    return sum( (a-b)**2 for a,b in zip(x,y) )

given = [0, 1, 2, 3, 4, 5]
A = [0.1, 0.9, 2, 3.3, 3.6, 5.1]
B = [-0.1, 0.9, 2.1, 3.1, 3.9, 5]
C = [0, 1.1, 2, 2.9, 4, 5.1]

min((A,B,C),key=lambda x:distance(x,given))

Assuming np.ndarrays of the same size, distance could be written as:

def distance(x,y):
    return ((x-y)**2).sum()
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • @JoeFlip -- Yes. It starts from the beginning of the list. I really don't know how you want to handle sequences which have un-equal lengths, but `itertools.izip_longest` *might* be useful for that case (instead of my `zip` above). – mgilson Nov 26 '12 at 15:39
1

You can use the sum of the squared errors. I made a quick example:

from copy import copy

def squaredError(a, b):
    r = copy(a)

    for i in range(len(a)):
        r[i] -= b[i]
        r[i] *= r[i]

    return sum(r)

given = [0, 1, 2, 3, 4, 5]
A = [0.1, 0.9, 2, 3.3, 3.6, 5.1]
B = [-0.1, 0.9, 2.1, 3.1, 3.9, 5]
C = [0, 1.1, 2, 2.9, 4, 5.1]

print squaredError(given, A)
print squaredError(given, B)
print squaredError(given, C)

match = min(map(lambda x: (squaredError(given, x), x), [A,B,C]))[1]
print match
Fred
  • 26
  • 1