1

Let's say that I have a list of things, and their frequency, (sorted by frequency) and the total number of items (I use a dict here for clarity, but actually they are objects with a frequency property):

items = {"bananas":12, "oranges":12, "apples":11, "pears":2}

Now, I want to pick out 10 items (max_results) out my my 37 (total_frequency) items, but in proportion to their frequency (with a maximum, of say, 3 of any item - max_proportion). In this example I'd end up with 3 each of bananas, oranges, and apples, and 1 pear.

def get_relative_quantities(total_frequency, items, max_results, max_proportion):
    results = {}
    num_added = 0
    for freq, the_group in it.groupby(items, lambda x: x.frequency):
        if num_added == max_results:
            break

        the_group_list = list(the_group)
        group_size = len(the_group_list)
        shuffle(the_group_list)

        for item in the_group_list:
            if num_added == max_results:
                break

            rel_freq = min(math.ceil((freq/total_frequency)*max_results), max_proportion)
            results[item] = rel_freq
            num_added += rel_freq

    return results

One thing I'm worried about is that with this approach if there is only 1 item, I won't get enough results. I'll just get 3 (assuming a max_proportion f 3 out of 10). How can I approach that problem?

dougalg
  • 529
  • 4
  • 14
  • `sum = 12 + 12 + 11 + 2; int(12. / sum * 10.)` ? `12/12 * 10` will be 10. – irrelephant Dec 27 '12 at 04:20
  • 1
    Why are you recalculating the frequency every time you add an item? Also, should the results be randomized or is this function's output supposed to be stable? – Blender Dec 27 '12 at 04:43
  • 4
    You have an overdetermined problem: there are too many constraints. In particular, if you say you want to pick 10 items out of the 37 in proportion to their frequencies, that alone is enough to determine how many of each item should be picked. If you then put in the additional requirement that no more than 3 of a kind get selected, you have to figure out how to reconcile that with the other conditions. There are many ways to do so, and which way you pick is something you have to figure out for yourself, it's not something Stack Overflow can tell you. – David Z Dec 27 '12 at 05:05
  • In other news, [this question](http://stackoverflow.com/questions/8685308/allocate-items-according-to-an-approximate-ratio-in-python), [this one](http://stackoverflow.com/questions/9088403/distributing-integers-using-weights-how-to-calculate), and [this one](http://stackoverflow.com/questions/792460/how-to-round-floats-to-integers-while-preserving-their-sum) (and others from the "Linked" section on the sidebar) might be useful to you. – David Z Dec 27 '12 at 05:06
  • 1
    Well, now I feel terrible that I missed those. Thank you very much for your comments. – dougalg Dec 27 '12 at 05:24
  • Also, thank you for pointing out "overdetermined problem", that makes a lot of sense. – dougalg Dec 27 '12 at 05:27

3 Answers3

0

That will depend on which strategy makes more sense for your needs. Let's say your max_results is 10 and your max_proportion is 2. What should be returned? The first iteration will get 2 of each.

  • if you discard your results and redo everything, increasing max_proportion to 3, the number of pears will drop to 1 (i.e. the result will be like your example);
  • If you keep the results and do a new iteration, with max_results = 2 and max_proportion = 1, you'll increase one banana and one orange;
    • And if max_proportion is kept at 2, you might get 2 bananas or 2 oranges, and none of the other.

Whatever your desired output is, my suggestion is the same: check if there are enough results and, if necessary, call get_relative_quantities again, either reducing max_results (to get the remaining elements) or increasing max_proportion (discarding the initial results and accepting more and more of each item). Do this as many times as needed to reach the desired number or to exhaust the possibilities. (this is the same principle behind iterative deepening)

mgibsonbr
  • 21,755
  • 7
  • 70
  • 112
0

First, build up a list of items with proportional numbers of elements:

items = {"bananas":12, "oranges":12, "apples":11, "pears":2}

choices = []
[choices.extend([k] * v) for k, v in items.items()]

Next, set up the final results with the minimum numbers of each (one of each possible item):

selected = list(items.keys())

Finally, for the rest of the items you want to select, choose a random one from the list of items duplicated proportionally:

import random as rnd
[selected.append(rnd.choice(choices)) for i in xrange(10 - len(items))]

All those snippets combined:

import random as rnd

items = {"bananas":12, "oranges":12, "apples":11, "pears":2}

choices = []
[choices.extend([k] * v) for k, v in items.items()]

selected = list(items.keys())
[selected.append(rnd.choice(choices)) for i in xrange(10 - len(items))]

And the output from a run:

>>> pp.pprint(selected)
['pears',
 'bananas',
 'oranges',
 'apples',
 'bananas',
 'bananas',
 'oranges',
 'apples',
 'apples',
 'apples']
0

You can use the d'Hondt method (or Jefferson method) to do it.

import heapq, collections, itertools

def fruit_divided(fruit, weight, max_proportion):
    for div in range(1, min(weight, max_proportion) + 1):
        yield (- weight / div, fruit)

def pick(items, max_results, max_proportion):
        fruits = heapq.merge(*(fruit_divided(fruit, frequency, max_proportion)
                               for fruit, frequency in items.items()))
        fruits = itertools.islice(fruits, max_results)
        return collections.Counter(fruit for _, fruit in fruits)

Sample run:

>>> items = {"bananas":12, "oranges":12, "apples":11, "pears":2}
>>> max_results = 10
>>> max_proportion = 3
>>> print(pick(items, max_results, max_proportion))
Counter({'oranges': 3, 'bananas': 3, 'apples': 3, 'pears': 1})

If there can only be picked less than max_results fruit, the highest possible number will be returned.

>>> print(pick(items, max_results, max_proportion))
Counter({'oranges': 3, 'bananas': 3, 'apples': 3, 'pears': 2})
Reinstate Monica
  • 4,568
  • 1
  • 24
  • 35