2

I need some help defining a custom similarity measure.

I have a dataset whose elements are defined by 4 attributes. As an example, consider the following two items:

Element 1:

A1: "R1", "R3", "R4", "R7"
A2: "H1"
A3  "F1", "F2"
A4  "aaa" "bbb"


Element 2:

A1: "R1", "R2"
A2: "H1"
A3  "F1", "F2"
A4  "aaa" "bbb" "ccc" "ddd" "eee" "fff"

I have to implement a similarity measure which should satisfies the following conditions:

1 - If A2 value is the same, the two elements must belong to the same cluster

2 - If two elements have at least one common value on A4, the who elements must belong to the same cluster.

I need to use a sort of weighted Jaccard measure. Is it mathematically correct to define a similarity measure that sums the jaccard distance of each attribute and then to add a sort of high weigth if condition 1 and 2 are satisfied for A2 and A4?

If so, how can I transform the similarity matrix into a distance matrix?

betto86
  • 694
  • 1
  • 8
  • 23
  • `Is it mathematically correct to define ... ` well that's certainly not a programming question. There's a couple of things a transformation must fulfill to be a metric. You can look it up, then you have to check... Probably off-topic here. – cel Sep 29 '15 at 16:08

1 Answers1

2

(1) Distance = 1 - similarity. This is a common characteristic.

(2) Summing the distances of the attributes is valid, although you may wish to scale it back to the [0, 1] range.

(3) Putting a high weight is not correct for what you've described. If the A2 or A4 values show a match, simply set the distance to 0. The clustering is a requirement, not merely strong advice. Is there some other semantic to your distance function, that you didn't want to take this route?

FYI, the basics for being a topological metric's distance function, D are:

D(a, a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)
Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thank you Prune for all the good hints :) Maybe it's better to consider the clustering condition as a very strong advice. If I set the distance to 0, what I'll miss is the distance information about all the others attributes. Even if these attributes have a small weight, for me it's important to save these differences. What about normalizing the other attributes distance into a range of [0 - 0.5] and giving 0.25 for A2 and 0.25 for A4? I know that if only A2 and A4 are matched will cause some problems, but basing on the nature of the elements that I'm working with it's a very rare condition. – betto86 Sep 30 '15 at 08:07
  • The suggested metric still doesn't work in general; your requirements insist that *either* A2 or A4 matching *must* have precedence over any other factors, combined. You could give them each 0.34 and leave 0.32 for the remainder, canting the clustering algorithm with a threshold of 0.34 or less. One problem here is that you're trying to handle three disjoint requirements -- two boolean and one gradient -- with a single gradient metric. Are you also writing your own clustering algorithm? You could also handle the boolean requirements there with a pair, such as (True, 0.28). – Prune Sep 30 '15 at 15:58