0

I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.

An example of the data might look like:

NR  entity_id type_of_order total_value
 1    1           A           10
 2    1           B           90
 3    1           C           70
 4    2           B           20
 5    2           C           40
 6    3           A           10
 7    3           B           50
 8    3           C           20
 9    4           B           50
 10   4           C           80

My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.

Would a simple KNN give satisfactory results or should I consider other algorithms?

Any suggestion would be much appreciated.

Marc Zaharescu
  • 629
  • 1
  • 13
  • 34
  • What distance function is best to use really depends on the application. Try out a few and see which gives you the best results. Common ones include the L1 and L2 norms. You would have to map the type_of_order to a number first. KNN is a classification scheme by the way, not a metric, so I don't know how that would be used for this. Or maybe I misunderstand the question. – Lidae Nov 10 '16 at 02:47

1 Answers1

0

The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.

You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...

  1. relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.

    Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.

  2. relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?

Prune
  • 76,765
  • 14
  • 60
  • 81