0

If I have a large list of items, each with a list of attributes which can contain multiple (or no) scores, what would be a good method for ranking these items fairly taking into account possibly unequal amount of information known about each item?

For example:

Item1

Attribute1 Values (70) Attribute2 Values (90) Attribute3 Values (null)

Item2

Attribute1 Values (50; 60; 70) Attribute2 Values (90) Attribute3 Values (10)

Here, simply averaging values would rank Item1 higher than Item2 - but in practice they /could/ be identical because Item2 simply has more data known. Can anyone suggest a method for comparing and ranking data like this?

  • Did you already consider [Radix Sort](http://en.wikipedia.org/wiki/Radix_sort) ? – wesley.mesquita Jan 24 '14 at 16:17
  • For all we know, Item2 is higher than Item1 because it has a value of 1000 that you don't know about. Seems like any kind of ranking scheme could fail, with this amount of uncertainty. – Kevin Jan 24 '14 at 16:23
  • You have to do something about missing attributes. This is a common problem in machine learning. See this StackOverflow post for ideas: http://stackoverflow.com/questions/13425722/how-to-deal-with-missing-attribute-values-in-c4-5-j48-decision-tree – AndyG Jan 24 '14 at 17:01

1 Answers1

0

You can do something like: sum/(count+1).

If attribute is null, sum is 0 and count is 0. So value is 0/(0+1) = 0.

For (70), you get 70/2 = 35.

For (50, 60, 70), you get 180/4 = 45.

A more advance approach could be to (sum+base)/(count+1). You need to choose an appropriate base though.

ElKamina
  • 7,747
  • 28
  • 43