11

I have a list of 6500 items that I would like to trade or invest in. (Not for real money, but for a certain game.) Each item has 5 numbers that will be used to rank it among the others.

Total quantity of item traded per day: The higher this number, the better.

The Donchian Channel of the item over the last 5 days: The higher this number, the better.

The median spread of the price: The lower this number, the better.

The spread of the 20 day moving average for the item: The lower this number, the better.

The spread of the 5 day moving average for the item: The higher this number, the better.

All 5 numbers have the same 'weight', or in other words, they should all affect the final number in the with the same worth or value.

At the moment, I just multiply all 5 numbers for each item, but it doesn't rank the items the way I would them to be ranked. I just want to combine all 5 numbers into a weighted number that I can use to rank all 6500 items, but I'm unsure of how to do this correctly or mathematically.

Note: The total quantity of the item traded per day and the donchian channel are numbers that are much higher then the spreads, which are more of percentage type numbers. This is probably the reason why multiplying them all together didn't work for me; the quantity traded per day and the donchian channel had a much bigger role in the final number.

Michael N
  • 3
  • 2
Krzysztof Czelusniak
  • 1,217
  • 4
  • 16
  • 26
  • This sounds a lot like a machine learning algorithm - you're trying to find some function of those five parameters that estimates the "goodness" of the stock. I'm no machine learning expert, but perhaps someone else here is? – templatetypedef Jan 05 '11 at 01:04
  • 1
    can you add a concrete example? in general you will want to normalize the numbers in your input data data then simply adding them together should produce a solid ranking. – madmik3 Jan 05 '11 at 01:09
  • For example, the quantity traded per day is 400. Donchian Channel is 30000. The median spread is .30. The spread of the 20 day average is .85, and the spread of the 5 day average is 1.2. Adding the together raises the same problem as multiplying them. – Krzysztof Czelusniak Jan 05 '11 at 01:27
  • In case you want to do something about machine learning (as templatetypedef suspects), I can give you some links and recommendations; however, I'll save me the troubles if you do not. Please clarify. – foo Jan 05 '11 at 01:39
  • Hmm, I don't know if I need it. It sort of seems complicated, so maybe there's a simpler solution. – Krzysztof Czelusniak Jan 05 '11 at 01:59
  • No, either you need it or you don't - it depends on the problem you have whether it is useful at all. Machine learning is for a different sort of problems than those you can solve with straightforward algorithms. – foo Jan 05 '11 at 02:25
  • 1
    is that "certain game" EVE online, perchance? :) – foo May 01 '13 at 19:37

5 Answers5

15

The reason people are having trouble answering this question is we have no way of comparing two different "attributes". If there were just two attributes, say quantity traded and median price spread, would (20million,50%) be worse or better than (100,1%)? Only you can decide this.

Converting everything into the same size numbers could help, this is what is known as "normalisation". A good way of doing this is the z-score which Prasad mentions. This is a statistical concept, looking at how the quantity varies. You need to make some assumptions about the statistical distributions of your numbers to use this.

Things like spreads are probably normally distributed - shaped like a normal distribution. For these, as Prasad says, take z(spread) = (spread-mean(spreads))/standardDeviation(spreads).

Things like the quantity traded might be a Power law distribution. For these you might want to take the log() before calculating the mean and sd. That is the z score is z(qty) = (log(qty)-mean(log(quantities)))/sd(log(quantities)).

Then just add up the z-score for each attribute.

To do this for each attribute you will need to have an idea of its distribution. You could guess but the best way is plot a graph and have a look. You might also want to plot graphs on log scales. See wikipedia for a long list.

Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134
  • +1 learned more about how to think about normalizing numbers with this answer then I've done in the year prior. – orangepips Jan 05 '11 at 17:30
  • Hmm. So, I should graph the data for all 5 categories, then compare the graphs to the list of distributions on wikipedia? I'm unsure of how to graph it though. I can put the data on the y axis, so should I put the specific item on the x axis? If I do that I'm guessing I should sort the items on the x axis by the corresponding value of the attribute from low to high? – Krzysztof Czelusniak Jan 06 '11 at 01:27
  • You need to create a histogram (similar to a bar chart). The easiest way to do it is with excel (or open office equivalent). See http://www.ncsu.edu/labwrite/res/gt/gt-bar-home.html#ith – Nick Fortescue Jan 06 '11 at 07:14
5

You can replace each attribute-vector x (of length N = 6500) by the z-score of the vector Z(x), where

Z(x) = (x - mean(x))/sd(x).

This would transform them into the same "scale", and then you can add up the Z-scores (with equal weights) to get a final score, and rank the N=6500 items by this total score. If you can find in your problem some other attribute-vector that would be an indicator of "goodness" (say the 10-day return of the security?), then you could fit a regression model of this predicted attribute against these z-scored variables, to figure out the best non-uniform weights.

Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
3

Start each item with a score of 0. For each of the 5 numbers, sort the list by that number and add each item's ranking in that sorting to its score. Then, just sort the items by the combined score.

Null Set
  • 5,374
  • 24
  • 37
  • Agreed, this would be the simplest way and do what the OP wants – Prasad Chalasani Jan 05 '11 at 01:26
  • This is what I was planning on doing. However, the range of the data varies greatly. For example, the quantity traded can range from 2 to 40million. The 40million is much higher then the next result, which is about 20millionm which is also significantly higher then the next item. Ranking by the order traded per day would only give the item with 40million one ranking position over the item with 20million, even though there is a much bigger difference of about 20million. This is what I'll do if I don't find a better solution though. – Krzysztof Czelusniak Jan 05 '11 at 01:32
  • So you do need normalization, after all. – foo Jan 05 '11 at 01:52
2

You would usually normalize your data entries to their respective range. Since there is no fixed range for them, you'll have to use a sliding range - or, to keep it simpler, normalize them to the daily ranges.

For each day, get all entries for a given type, get the highest and the lowest of them, determine the difference between them. Let Bottom=value of the lowest, Range=difference between highest and lowest. Then you calculate for each entry (value - Bottom)/Range, which will result in something between 0.0 and 1.0. These are the numbers you can continue to work with, then.

Pseudocode (brackets replaced by indentation to make easier to read):

double maxvalues[5]; 
double minvalues[5];
// init arrays with any item
for(i=0; i<5; i++)
   maxvalues[i] = items[0][i]; 
   minvalues[i] = items[0][i]; 
// find minimum and maximum values
foreach (items as item)
   for(i=0; i<5; i++)
       if (minvalues[i] > item[i])
           minvalues[i] = item[i];
       if (maxvalues[i] < item[i])
           maxvalues[i] = item[i];

// now scale them - in this case, to the range of 0 to 1.
double scaledItems[sizeof(items)][5]; 
double t;
foreach(i=0; i<5; i++)
   double delta = maxvalues[i] - minvalues[i];
   foreach(j=sizeof(items)-1; j>=0; --j)
      scaledItems[j][i] = (items[j][i] - minvalues[i]) / delta; 
      // linear normalization

something like that. I'll be more elegant with a good library (STL, boost, whatever you have on the implementation platform), and the normalization should be in a separate function, so you can replace it with other variations like log() as the need arises.

foo
  • 1,968
  • 1
  • 23
  • 35
  • What I meant by combining the 5 numbers into a weighted number, I meant the final number will be weighted because each of the 5 factors would affect it equally. I have a set range, if by range you mean the highest number and the lowest number. However, the range goes from 1 (not 1mil, just 1) to 40million. Most of the range is between 100 to 400. Wouldn't the 40mil skew the results? – Krzysztof Czelusniak Jan 05 '11 at 01:58
  • OK, adapted the posting. By fixed range I meant a range over all values which they never exceed. - That's why I wrote about normalization. The linear normalization here will help against the differences between the five entries. You may want to improve on this by using something like logarithmic scaling on certain items, if their data is like that (distributed over magnitudes). Can't tell from what you wrote, however. – foo Jan 05 '11 at 02:02
  • I agree with @foo that doing a log or other scaling would help – Prasad Chalasani Jan 05 '11 at 02:19
0

Total quantity of item traded per day: The higher this number, the better. (a)

The Donchian Channel of the item over the last 5 days: The higher this number, the better. (b)

The median spread of the price: The lower this number, the better. (c)

The spread of the 20 day moving average for the item: The lower this number, the better. (d)

The spread of the 5 day moving average for the item: The higher this number, the better. (e)

a + b -c -d + e = "score" (higher score = better score)

  • This misses that (a) and (b) are much larger than the spread, and will make those two dominate the result. – Teepeemm Jul 29 '16 at 21:04