I need some help defining a custom similarity measure.
I have a dataset whose elements are defined by 4 attributes. As an example, consider the following two items:
Element 1:
A1: "R1", "R3", "R4", "R7"
A2: "H1"
A3 "F1", "F2"
A4 "aaa" "bbb"
Element 2:
A1: "R1", "R2"
A2: "H1"
A3 "F1", "F2"
A4 "aaa" "bbb" "ccc" "ddd" "eee" "fff"
I have to implement a similarity measure which should satisfies the following conditions:
1 - If A2 value is the same, the two elements must belong to the same cluster
2 - If two elements have at least one common value on A4, the who elements must belong to the same cluster.
I need to use a sort of weighted Jaccard measure. Is it mathematically correct to define a similarity measure that sums the jaccard distance of each attribute and then to add a sort of high weigth if condition 1 and 2 are satisfied for A2 and A4?
If so, how can I transform the similarity matrix into a distance matrix?