Real world Algorithm - Measuring uniqueness of input values

Question

I have a list of key value pairs. For each key, I want to see how unique the values are. For example, for a particular key k1, all the values might be the same. (best case). For a key k2, half of the values are one type and the other half are different. . . Similarly, for a key kx, none of the values match (worst case).

I want to give ranks (or percentages, whatever) to each of these keys based on the above and have a final ordering, so that I can filter out those which have many different values (lets say above a predefined threshold rank or percent).

I somehow think this is somewhat related to some concepts I learned in my data mining course, but just cannot recall effectively.

Thanks.

Can you show us what you have tried, and a specific problem you are having? — Skip Head, May 18 '12 at 23:25
The only problem I am having is I am not able to recollect what category of problem this is. I dont really want any solution to this. — dreamer13134, May 18 '12 at 23:34
Um, does categorising it matter that much? What types of categories were you thinking of? — Neil Coffey, May 19 '12 at 00:01

score -1 · Answer 1 · answered May 19 '12 at 10:00

In data mining terms from http://en.wikipedia.org/wiki/Association_rule_learning, you could regard an index as a means of predicting a value, in which case you might be interested in the confidence - the percentage of the most frequent value for that index. You could also look at the probability that two randomly chosen values are the same, which will be the sum of the squares of the frequencies of the values, or the http://en.wikipedia.org/wiki/Shannon_entropy - which will have similar properties, but involves taking logarithms.

NPE · Accepted Answer · 2012-05-19T10:16:06.067

-1

You could perhaps use some Information Theory for this.

For each key, you could compute the entropy of the values. The higher the entropy, the more diverse the key's values are. You could use that to rank the keys.

The following article discusses some related topics: Calculating Entropy for Data Mining.

edited May 19 '12 at 10:16

answered May 19 '12 at 10:08

NPE

486,780
108
951
1,012

One might have a look at Üli Maurer's "Universal Test for Random Bit Generator" which can basically be used as a special kind of entropy calculator and -for the required extend- is easily implemented. – JimmyB Jun 05 '12 at 10:27

Real world Algorithm - Measuring uniqueness of input values

2 Answers2