-1

I am designing a matching system and want to compute the similarity between pairs of numbers. So let us assume we have two set of numbers:

15 13 17 100

1 14 15 105 27 30

I would now like to compute the similarity between a) these two set of numbers AND b) between each and every number (so for example sim(15,1), sim(13,1), etc.) that return me a similarity value between 0 and 1.

My question now is if there exist similarity measures in literature for this task. If there is even a java implementation for them I would appreciate this even more.

UPDATE:

There exist a large amount of measures for String similarity (e.g. Levenshtein measure), but I could not find something equivalent for numbers.

The goal is to use this in a matching system which should return the similarity of two database rows between 0 and 1.

Thank you in advance!

user1729603
  • 73
  • 1
  • 8
  • similarity measure, you mean subtracting one from another and getting absolute value ? – Eddie Martinez Jul 08 '14 at 17:56
  • Can you provide additional details as to what you are using this for? It's a rather odd question as the similarity between two numbers needs to be defined using some sort of constraint. Eduardo suggests on such constraint above, distance. – Sesame Jul 08 '14 at 18:06
  • A note to keep in mind, distance and similarity are inverses. Low distance = high similarity. – pbible Jul 08 '14 at 18:11
  • Maybe you can consider the numbers (or sequences of numbers) as Strings, then you might have a look at this: http://en.wikipedia.org/wiki/Levenshtein_distance – Renato Jul 08 '14 at 18:26
  • Actually, I am looking for something like Levenshtein distance for numbers ;-). So I want to know if there are standardized ways of computing the similarity between two numbers or a set of numbers. Of course I can come up with a lot of add-hoc methodologies like Min(a,b) / Max(a,b) or something like that. However, I would like to know if there are standard ways of doing this that I can use as references. – user1729603 Jul 08 '14 at 18:33
  • There is a vast amount of literature of computing the similarity of strings - however, I have not found good sources for similarities about numbers! – user1729603 Jul 08 '14 at 18:36
  • What is the purpose? There may be domain-specific ideas. – Patricia Shanahan Jul 08 '14 at 18:50
  • The propose is to build a matching system for databases which returns a confidence value how similar two e.g. database rows are. – user1729603 Jul 08 '14 at 22:31
  • So the proposed approach would have to return reasonable results for any arbitrary number set. – user1729603 Jul 08 '14 at 22:32
  • When do you consider two numbers similar? If they are within 1%? If they are less than 100 apart? You could try with absolute difference, or take their ratio, or use logarithms, or do levensthein on their bit patterns. – tobias_k Jul 09 '14 at 09:08

1 Answers1

1

The bad news, as you pointed out, is that it has to work for arbitrary number sets. The good news is that you do have a sample from the number set.

You need to take into account the range and distribution of numbers in the whole column.

Suppose row A has value 1 in a particular column, and row B has value 3. Consider two different cases:

  1. All rows have value 1, 2, or 3, with roughly equal frequency. In this case, row A and row B are dissimilar in that column.
  2. All rows have values from the range 1 through 100, again with roughly equal frequency. Now row A and row B are quite similar in that column - most pairs of rows have values that differ by more than 2.

In the context of a database you may have additional information about the database design that should inform your row similarity measure. Even without that, you can look at the distribution of numbers in a numeric column and ask "What is the probability of two independent rows being this similar in this column by chance?".

I found some papers in this general area by searching for bayesian pairwise similarity. In particular, although for a different domain, Measuring similarity between gene expression profiles: a Bayesian approach, may contain some relevant ideas.

Patricia Shanahan
  • 25,849
  • 4
  • 38
  • 75
  • Thats roughly what I have meant with an add-hoc strategy. But maybe this is the best way to go... Lets wait if someone knows some published scientific work on that.. – user1729603 Jul 09 '14 at 16:51