0

Suppose I have a custom data structure Data that reveals two relevant properties: tag indicates which equivalence class this item belongs in, and rank indicates how good this item is.

I have an unordered set of Data objects, and want to retrieve the n objects with the highest rank—but with at most one object from each equivalence class.

(Objects in the same equivalence class don't necessarily compare equal, and don't necessarily have the same rank, but I don't want any two elements in my output to come from the same class. In other words, the relation that produces these equivalence classes isn't ==.)

My first approach looks something like this:

  • Sort the list by descending rank
  • Create an empty set s
  • For each element in the list:
    • Check if its tag is in s; if so, move on
    • Add its tag to s
    • Yield that element
    • If we've yielded n elements, stop

However, this feels awkward, like there should be some better way (potentially using itertools and higher-order functions). The order of the resulting n elements isn't important.

What's the Pythonic solution to this problem?

Toy example:

Data = namedtuple('Data', ('tag', 'rank'))
n = 3

algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }
expected_output = { Data('a', 200), Data('b', 50), Data('c', 10) }
Draconis
  • 3,209
  • 1
  • 19
  • 31
  • Can you post sample input and expected output? Any code so far? – Andrej Kesely Jul 20 '19 at 21:13
  • @AndrejKesely Added an example. The code I have so far just implements the algorithm outlined in the question, and it works fine—I'm just looking for a better way, if one exists. – Draconis Jul 20 '19 at 21:25

4 Answers4

1

You could use itertools.groupby (doc). First we sort the items by your criteria and then group them by tag (and store only first item from each group):

from itertools import groupby
from collections import namedtuple

Data = namedtuple('Data', ('tag', 'rank'))

n = 3

algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }

# 1. sort the data by rank (descending) and tag (ascending)
s = sorted(algorithm_input, key=lambda k: (-k.rank, k.tag))

# 2. group the data by tag and store first item from each group to 'out', limit the number of groups to 'n'
out = []
for (_, g), _ in zip(groupby(s, lambda k: k.tag), range(n)):
    out.append(next(g))

print(out)

Prints:

[Data(tag='a', rank=200), Data(tag='b', rank=50), Data(tag='c', rank=10)]

EDIT: Changed the sorting key.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

I think it would be faster to take the max element of each group (O(|elements|)) and then to get the n largest ranks (O(|groups|.lg n) with a heap of size n), rather than sort first (O(|elements|.lg |elements|)) and take n elements (O(|elements|)):

Create a dict max_by_tag that stores the item with the max rank by tag:

>>> from collections import namedtuple
>>> Data = namedtuple('Data', ('tag', 'rank'))
>>> n = 3
>>> algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }
>>> max_by_tag = {}
>>> for item in algorithm_input:
...     if item.tag not in max_by_tag or item.rank > max_by_tag[item.tag].rank:
...         max_by_tag[item.tag] = item

>>> max_by_tag
{'a': Data(tag='a', rank=200), 'b': Data(tag='b', rank=50), 'c': Data(tag='c', rank=10), 'd': Data(tag='d', rank=5)}

Then use the heapq module:

>>> import heapq
>>> heapq.nlargest(n, max_by_tag.values(), key=lambda data: data.rank)
[Data(tag='a', rank=200), Data(tag='b', rank=50), Data(tag='c', rank=10)]
jferard
  • 7,835
  • 2
  • 22
  • 35
1

Store the sorted input in a OrderedDict (with tag as the key and Data as the value). This would result in only one Data from each equivalent class being stored in the OrderedDict

>>> from collections import namedtuple, OrderedDict
>>> Data = namedtuple('Data', ('tag', 'rank'))
>>> n = 3
>>> algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }
>>> 
>>> set(list(OrderedDict((d.tag, d) for d in sorted(algorithm_input)).values())[:n])
{Data(tag='b', rank=50), Data(tag='a', rank=200), Data(tag='c', rank=10)}
Sunitha
  • 11,777
  • 2
  • 20
  • 23
  • Very nice! Because if I understand right, a later entry will override an earlier one? – Draconis Jul 20 '19 at 22:45
  • Yes.. As we sort first, all entries with higher rank would appear later and the latest entry would override all the earlier ones – Sunitha Jul 20 '19 at 22:46
0

If it's a class definition you control, I believe the most Pythonic way would be this:

from random import shuffle

class Data:

    def __init__(self, order=1):
        self.order = order

    def __repr__(self):
        return "Order: " + str(self.order)

if __name__ == '__main__':
    import sys
    d = []
    for i in range(0,10):
        d.append(Data(order=i))
    shuffle(d)

    print(d)

    print(sorted(d, key=lambda data: data.order))

Output:

[Order: 5, Order: 2, Order: 6, Order: 0, Order: 4, Order: 7, Order: 3, Order: 9, Order: 1, Order: 8]
[Order: 0, Order: 1, Order: 2, Order: 3, Order: 4, Order: 5, Order: 6, Order: 7, Order: 8, Order: 9]

So essentially, add an attribute to sort by to the class. Define the string rep (just to make it easier to see what's going on). Then use python's sorted() on a list of those object with a lambda function to indicate the attribute that each object should be sorted against.

Note: the comparison for that attribute type must be defined - here it's an int. In case the attribute is not defined, you would have to implement gt, let etc... for that attribute. See the docs for details.

logicOnAbstractions
  • 2,178
  • 4
  • 25
  • 37