Similarity measures for doubles

Question

I am designing a matching system and want to compute the similarity between pairs of numbers. So let us assume we have two set of numbers:

15 13 17 100

1 14 15 105 27 30

I would now like to compute the similarity between a) these two set of numbers AND b) between each and every number (so for example sim(15,1), sim(13,1), etc.) that return me a similarity value between 0 and 1.

My question now is if there exist similarity measures in literature for this task. If there is even a java implementation for them I would appreciate this even more.

UPDATE:

There exist a large amount of measures for String similarity (e.g. Levenshtein measure), but I could not find something equivalent for numbers.

The goal is to use this in a matching system which should return the similarity of two database rows between 0 and 1.

Thank you in advance!

similarity measure, you mean subtracting one from another and getting absolute value ? — Eddie Martinez, Jul 08 '14 at 17:56
Can you provide additional details as to what you are using this for? It's a rather odd question as the similarity between two numbers needs to be defined using some sort of constraint. Eduardo suggests on such constraint above, distance. — Sesame, Jul 08 '14 at 18:06
A note to keep in mind, distance and similarity are inverses. Low distance = high similarity. — pbible, Jul 08 '14 at 18:11
Maybe you can consider the numbers (or sequences of numbers) as Strings, then you might have a look at this: http://en.wikipedia.org/wiki/Levenshtein_distance — Renato, Jul 08 '14 at 18:26
Actually, I am looking for something like Levenshtein distance for numbers ;-). So I want to know if there are standardized ways of computing the similarity between two numbers or a set of numbers. Of course I can come up with a lot of add-hoc methodologies like Min(a,b) / Max(a,b) or something like that. However, I would like to know if there are standard ways of doing this that I can use as references. — user1729603, Jul 08 '14 at 18:33
There is a vast amount of literature of computing the similarity of strings - however, I have not found good sources for similarities about numbers! — user1729603, Jul 08 '14 at 18:36
The propose is to build a matching system for databases which returns a confidence value how similar two e.g. database rows are. — user1729603, Jul 08 '14 at 22:31
So the proposed approach would have to return reasonable results for any arbitrary number set. — user1729603, Jul 08 '14 at 22:32
When do you consider two numbers similar? If they are within 1%? If they are less than 100 apart? You could try with absolute difference, or take their ratio, or use logarithms, or do levensthein on their bit patterns. — tobias_k, Jul 09 '14 at 09:08

Patricia Shanahan · Accepted Answer · 2014-07-09T22:47:24.273

The bad news, as you pointed out, is that it has to work for arbitrary number sets. The good news is that you do have a sample from the number set.

You need to take into account the range and distribution of numbers in the whole column.

Suppose row A has value 1 in a particular column, and row B has value 3. Consider two different cases:

All rows have value 1, 2, or 3, with roughly equal frequency. In this case, row A and row B are dissimilar in that column.
All rows have values from the range 1 through 100, again with roughly equal frequency. Now row A and row B are quite similar in that column - most pairs of rows have values that differ by more than 2.

In the context of a database you may have additional information about the database design that should inform your row similarity measure. Even without that, you can look at the distribution of numbers in a numeric column and ask "What is the probability of two independent rows being this similar in this column by chance?".

I found some papers in this general area by searching for bayesian pairwise similarity. In particular, although for a different domain, Measuring similarity between gene expression profiles: a Bayesian approach, may contain some relevant ideas.

Thats roughly what I have meant with an add-hoc strategy. But maybe this is the best way to go... Lets wait if someone knows some published scientific work on that.. — user1729603, Jul 09 '14 at 16:51

Similarity measures for doubles

1 Answers1