-1

I have a list of key value pairs. For each key, I want to see how unique the values are. For example, for a particular key k1, all the values might be the same. (best case). For a key k2, half of the values are one type and the other half are different. . . Similarly, for a key kx, none of the values match (worst case).

I want to give ranks (or percentages, whatever) to each of these keys based on the above and have a final ordering, so that I can filter out those which have many different values (lets say above a predefined threshold rank or percent).

I somehow think this is somewhat related to some concepts I learned in my data mining course, but just cannot recall effectively.

Thanks.

dreamer13134
  • 471
  • 1
  • 6
  • 19

2 Answers2

-1

In data mining terms from http://en.wikipedia.org/wiki/Association_rule_learning, you could regard an index as a means of predicting a value, in which case you might be interested in the confidence - the percentage of the most frequent value for that index. You could also look at the probability that two randomly chosen values are the same, which will be the sum of the squares of the frequencies of the values, or the http://en.wikipedia.org/wiki/Shannon_entropy - which will have similar properties, but involves taking logarithms.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
-1

You could perhaps use some Information Theory for this.

For each key, you could compute the entropy of the values. The higher the entropy, the more diverse the key's values are. You could use that to rank the keys.

The following article discusses some related topics: Calculating Entropy for Data Mining.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • One might have a look at Üli Maurer's "Universal Test for Random Bit Generator" which can basically be used as a special kind of entropy calculator and -for the required extend- is easily implemented. – JimmyB Jun 05 '12 at 10:27