0

I have two arrays generated by two different systems that are independent of each other. I want to compare their similarities by comparing only a few numbers generated from the arrays.

Right now, I'm only comparing min, max, and sums of the arrays, but I was wondering if there is a better algorithm out there? Any type of hashing algorithm would need to be insensitive to small floating point differences between the arrays.

EDIT: What I'm trying to do is to verify that two algorithms generate the same data without having to compare the data directly. So the algorithm should be sensitive to shifts in the data and relatively insensitive to small differences between each element.

CookieOfFortune
  • 13,836
  • 8
  • 42
  • 58
  • Do you want to know if they are identical if the floating point values are approximated ? – user1952500 Mar 05 '13 at 21:21
  • "Similarities" in what sense? Like what about comparing their means and standard deviations? – BrenBarn Mar 05 '13 at 21:23
  • @user1952500 pretty much yes, but I'm not sure how different the values might be. – CookieOfFortune Mar 05 '13 at 21:26
  • @BrenBarn I am already comparing them that way, I would like to just have one number that will tell me if the arrays are similar and how similar they are. – CookieOfFortune Mar 05 '13 at 21:28
  • @CookieOfFortune: Do you really need one number? You can pass around a tuple as easily as a number, and you can write a "close_enough" function that takes that tuple as easily as you can write one that takes a number… – abarnert Mar 05 '13 at 21:29
  • @abarnert I suppose a tuple would be ok too. But what combination of numbers do I pass around? I'm already doing the basic statistical measurements (mean, stddev, min/max). – CookieOfFortune Mar 05 '13 at 21:30
  • This isn't really a programming question but a math/statistics question. There are different ways to characterize "how similar" two sets of numbers are. Taking one set of numbers and adding 100 to all of them produces another set that is similar in overall "shape" but different in overall magnitude. No one can tell you what to calculate until you can explicitly characterize what you mean by "similar". – BrenBarn Mar 05 '13 at 21:37
  • @BrenBarn You're right, I will post this to the stats exchange. – CookieOfFortune Mar 05 '13 at 21:38
  • What are you _actually_ trying to do here? What do the numbers represent? There's a very good chance that whatever you're trying to design has already been designed, and is actually much more complicated than just reducing things to a single number, and whatever you come up with from first principles without reading the research or even really thinking through the problem will not be very useful. – abarnert Mar 05 '13 at 21:46
  • @abarnert I have added an update about what I'm trying to calculate and what properties I'm sensitive to. – CookieOfFortune Mar 05 '13 at 21:52
  • @CookieOfFortune: Your updated question still just describes them as "arrays generated by two systems". Without knowing anything at all about those systems or the arrays they generate, or what those values represents, or anything else, nobody can give you anything other than vague and generic answers which will most likely be pretty bad for your actual application. – abarnert Mar 05 '13 at 22:00
  • @abarnert Sometimes the arrays are images, other times they're a list of numbers, it varies depending on what I'm running. I am indeed looking for something general but there does not seem to be a simple solution. – CookieOfFortune Mar 05 '13 at 22:01
  • @CookieOfFortune: If they're images, there are image-fingerprinting algorithms, or more generally image-comparison algorithms, that are going to do a lot better than any general-purpose hashing or comparison algorithm. Is there a reason you have to treat "images" and "list of numbers" the same? – abarnert Mar 05 '13 at 22:17
  • @abarnert Mainly for simplicity and curiosity's sake. – CookieOfFortune Mar 05 '13 at 22:26

2 Answers2

1

I wouldn't try to reduce this to one number; just pass around a tuple of values, and write a close_enough function that compares the tuples.

For example, you could use (mean, stdev) as your value, and then define close_enough as "each array's mean is within 0.25 stdev of the other array's mean".

def mean_stdev(a):
    return mean(a), stdev(a)

def close_enough(mean_stdev_a, mean_stdev_b):
    mean_a, stdev_a = mean_stdev_a
    mean_b, stdev_b = mean_stdev_b
    diff = abs(mean_a - mean_b)
    return (diff < 0.25 * stdev_a and diff < 0.25 * stdev_b)

Obviously the right value is something you want to tune based on your use case. And maybe you actually want to base it on, e.g., variance (square of stdev), or variance and skew, or stdev and sqrt(skew), or some completely different normalization besides arithmetic mean. That all depends on what your numbers represent, and what "close enough" means.

Without knowing anything about your application area, it's hard to give anything more specific. For example, if you're comparing audio fingerprints (or DNA fingerprints, or fingerprint fingerprints), you'll want something very different from if you're comparing JPEG-compressed images of landscapes.


In your comment, you say you want to be sensitive to the order of the values. To deal with this, you can generate some measure of how "out-of-order" a sequence is. For example:

diffs = [elem[0] - elem[1] for elem in zip(seq, sorted(seq))]

This gives you the difference between each element and the element that would be there in sorted position. You can build a stdev-like measure out of this (square each value, average, sqrt), or take the mean absolute diff, etc.

Or you could compare how far away the actual index is from the "right" index. Or how far the value is from the value expected at its index based on the mean and stdev. Or… there are countless possibilities. Again, which is appropriate depends heavily on your application area.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I guess the main issue with the main statistical measurements is that they're not sensitive to things like shift (where the statistical distribution is similar but the contents of the array are in a different order). I will make an edit to my question to address this. – CookieOfFortune Mar 05 '13 at 21:35
  • @CookieOfFortune: OK, I can edit my answer to deal with that. – abarnert Mar 05 '13 at 21:37
1

Depends entirely on your definition of "compare their similarities".

What features do you want to compare? What features can you identify? are their identifiable patterns? i.e in this set, there are 6 critical points, there are 2 discontinuities... etc...

You've already mentioned comparing the min/max/sum; and means and standard deviations have been talked about in comments too. These are all features of the set.

Ultimately, you should be able to take all these features and make an n-dimensional descriptor. For example [min, max, mean, std, etc...]

You can then compare these n-dimensional descriptors to define whether one is "less", "equal" or "more" than the other. If you want to classify other sets into whether they are more like "set A" or more like "set B", you could look into classifiers.

See:

Classifying High-Dimensional Patterns Using a Fuzzy Logic

Support Vector Machines

Meirion Hughes
  • 24,994
  • 12
  • 71
  • 122