23

I have a dictionary as follows:

{'abc':100,'xyz':200,'def':250 .............}

It is a dictionary with keys as a name of a entity and the value is count of that entity. I need to return the top 10 elements of from the dictionary.

I can write a heap to do it, but I'm not sure how to do value to key mapping as certain values will be equal.

Is there any other data structure to do this?

gizgok
  • 7,303
  • 21
  • 79
  • 124

8 Answers8

27

Using heapq you probably want to do something like this:

heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]

Note that since heapq implements only a min heap it's better to invert the values, so that bigger values become smaller.

This solution will be slower for small sizes of the heap, for example:

>>> import random
>>> import itertools as it
>>> def key_generator():
...     characters = [chr(random.randint(65, 90)) for x in range(100)]
...     for i in it.count():
...             yield ''.join(random.sample(characters, 3))
... 
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
...     items = [(-value, key) for key, value in the_dict.items()]
...     smallest = heapq.nsmallest(10, items)
...     return [-value for value, key in smallest]
... 
>>> def with_sorted(the_dict):
...     return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
... 
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902

With 3000 values it's just slightly faster than the sorted version, which is O(nlogn) instead of O(n + mlogn). If we increase the size of the dict to 10000 the heapq version becomes even faster:

>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549

The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted version because it's simpler.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
3

Using heap is a best solution with time complexity: O(nlogk). where n is length of the heap and k is 10 here.

Now the trick with mapping of keys is that we can create another class for comparison of key and define magic methods __lt__() __gt__(). which overrides < , > operators

import heapq
class CompareWord:
  def __init__(self , word , value):
    self.word = word
    self.value = value

  def __lt__(self, other):   #To override > operator
    return self.value < other.value

  def __gt__(self , other):  #To override < operator
    return self.value > other.value

  def getWord(self):
    return self.word

def findKGreaterValues(compare_dict , k):
  min_heap = []
  for word in compare_dict:
      heapq.heappush(min_heap , CompareWord(word ,compare_dict[word] ))
      if(len(min_heap) > k):
          heapq.heappop(min_heap)   
  answer = []
  for compare_word_obj in min_heap:
      answer.append(compare_word_obj.getWord())

  return answer
Manish Chauhan
  • 167
  • 2
  • 10
2

For getting the top 10 elements, assuming that the number is in the second place:

from operator import itemgetter

topten = sorted(mydict.items(), key=itemgetter(1), reverse = True)[0:10]

if you want to sort by value then key just change it to key=itemgetter(1,0).

As for a data structure, a heap sounds like what you would want. Just keep them as tuples, and compare the number term.

Stephen
  • 2,365
  • 17
  • 21
1

If the dictionary remains constant, then instead of trying to create a heapq directly or through collections.Counter, you can try to sort the dictionary items using the value as key in reverse order and then get the first 10 elements from it. You need to recreate the dictionary from the tuples

>>> some_dict = {string.ascii_lowercase[random.randint(0,23):][:3]:random.randint(100,300) for _ in range(100)}
>>> some_dict
{'cde': 262, 'vwx': 175, 'xyz': 163, 'uvw': 288, 'qrs': 121, 'mno': 192, 'ijk': 103, 'abc': 212, 'wxy': 206, 'efg': 256, 'opq': 255, 'tuv': 128, 'jkl': 158, 'pqr': 291, 'fgh': 191, 'lmn': 259, 'rst': 140, 'hij': 192, 'nop': 202, 'bcd': 258, 'klm': 145, 'stu': 293, 'ghi': 264, 'def': 260}
>>> sorted(some_dict.items(), key = operator.itemgetter(1), reverse = True)[:10]
[('stu', 293), ('pqr', 291), ('uvw', 288), ('ghi', 264), ('cde', 262), ('def', 260), ('lmn', 259), ('bcd', 258), ('efg', 256), ('opq', 255)]

If you are using heapq, to create the heap, you need nlogn operations, if you are building a heap by inserting the elements or logn if you heapify a list, followed by mlogn operations to fetch the top m elements

If you are sorting the items, Python Sorting algorithm is guaranteed to be O(nlogn) in worst case (refer TIM Sort) and fetching the first 10 elements would be a constant operations

Abhijit
  • 62,056
  • 18
  • 131
  • 204
  • Actually building the heap is `O(n)` and not `O(nlogn)`. From the documentation: "heapify(x) # transforms list into a heap, in-place, **in linear time**" – Bakuriu Feb 10 '13 at 06:57
1

you can implement the lt function in your class where you can specify which attributes should be compared.

def __lt__(self, other):
   return self.attribute if self.attribute < other else other


Selina
  • 11
  • 1
0

Imagine a dict like so (mapping of a-z with a=1 and z=26):

>>> d={k:v for k,v in zip((chr(i+97) for i in range(26)),range(1,27))}
>>> d
{'g': 7, 'f': 6, 'e': 5, 'd': 4, 'c': 3, 'b': 2, 'a': 1, 'o': 15, 'n': 14, 'm': 13, 'l': 12, 'k': 11, 'j': 10, 'i': 9, 'h': 8, 'w': 23, 'v': 22, 'u': 21, 't': 20, 's': 19, 'r': 18, 'q': 17, 'p': 16, 'z': 26, 'y': 25, 'x': 24}

Now you can do this:

>>> v=list(d.values())
>>> k=list(d.keys())
>>> [k[v.index(i)] for i in sorted(d.values(),reverse=True)[0:10]]
['z', 'y', 'x', 'w', 'v', 'u', 't', 's', 'r', 'q']

You also stated that some values of the mapping will be equal. Now let's update d so it has the letters A-Z with the mapping 1-26:

>>> d.update({k:v for k,v in zip((chr(i+65) for i in range(26)),range(1,27))})

Now both A-Z and a-z map to 1-26:

>>> d
{'G': 7, 'F': 6, 'E': 5, 'D': 4, 'C': 3, 'B': 2, 'A': 1, 'O': 15, 'N': 14, 'M': 13, 'L': 12, 'K': 11, 'J': 10, 'I': 9, 'H': 8, 'W': 23, 'V': 22, 'U': 21, 'T': 20, 'S': 19, 'R': 18, 'Q': 17, 'P': 16, 'Z': 26, 'Y': 25, 'X': 24, 'g': 7, 'f': 6, 'e': 5, 'd': 4, 'c': 3, 'b': 2, 'a': 1, 'o': 15, 'n': 14, 'm': 13, 'l': 12, 'k': 11, 'j': 10, 'i': 9, 'h': 8, 'w': 23, 'v': 22, 'u': 21, 't': 20, 's': 19, 'r': 18, 'q': 17, 'p': 16, 'z': 26, 'y': 25, 'x': 24} 

So with duplicate mappings, the only sensible result is to return a list of keys that have the value:

>>> [[k[x] for x,z in enumerate(v) if z==i ] for i in sorted(d.values(),reverse=True)[0:10]]
[['Z', 'z'], ['Z', 'z'], ['Y', 'y'], ['Y', 'y'], ['X', 'x'], ['X', 'x'], ['W', 'w'], ['W', 'w'], ['V', 'v'], ['V', 'v']]

And you could use heapq here:

[[k[x] for x,z in enumerate(v) if z==i ] for i in heapq.nlargest(10,v)]

You did not state what you would want to do with the duplicate results, so I assume you would want those duplicates eliminated while the result list is to remain N long.

This does that:

def topn(d,n):
    res=[]
    v=d.values()
    k=d.keys()
    sl=[[k[x] for x,z in enumerate(v) if z==i] for i in sorted(v)]
    while len(res)<n and sl:
        e=sl.pop()
        if e not in res:
            res.append(e)

    return res

>>> d={k:v for k,v in zip((chr(i+97) for i in range(26)),range(1,27))}
>>> d.update({k:v for k,v in zip((chr(i+65) for i in range(0,26,2)),range(1,27,2))})  
>>> topn(d,10)
[['z'], ['Y', 'y'], ['x'], ['W', 'w'], ['v'], ['U', 'u'], ['t'], ['S', 's'], ['r'], ['Q', 'q']]
dawg
  • 98,345
  • 23
  • 131
  • 206
0

Bakuriu's answer is correct (use heapq.nlargest).

But if you're interested in the right algorithm to use, quickselect uses a similar principle to quicksort, and was invented by the same person: C.A.R. Hoare.

It differs, however, by not fully sorting the array: after finishing, if you asked for the top n elements then they're in the first n positions in the array, but not necessarily in sorted order.

Like quicksort, it starts by choosing a pivot element and pivoting the array so that all a[:j] are less than or equal to a[j] and all a[j+1:] are greater than a[j].

Next, if j == n, then the largest elements are a[:j]. If j > n, then quickselect is called recursively only on the elements left of the pivot. And if j < n then quickselect is called on the elements to the right of the pivot to extract the n - j - 1 largest elements from those.

Because quickselect is called recursively on only one side of the array (unlike quicksort which is called recursively on both), it works in linear time (if the input is randomly ordered, and there are no repeated keys). This also helps turning the recursive call into a while loop.

Here's some code. To help understand it, the invariants in the outer while loop are that the elements xs[:lo] are guaranteed to be in list of n largest, and that the elements xs[hi:] are guaranteed not to be in the n largest.

import random

def largest_n(xs, n):
    lo, hi = 0, len(xs)
    while hi > n:
        i, j = lo, hi
        # Pivot the list on xs[lo]
        while True:
            while i < hi and xs[i] >= xs[lo]:
                i += 1
            j -= 1
            while j >= lo and xs[j] < xs[lo]:
                j -= 1
            if i > j:
                break
            xs[i], xs[j] = xs[j], xs[i]
        # Move the pivot to xs[j]
        if j > lo:
            xs[lo], xs[j] = xs[j], xs[lo]
        # Repeat on one side or the other based on the location of the pivot.
        if n <= j:
            hi = j
        else:
            lo = j + 1
    return xs[:n]


for k in xrange(100):
    xs = range(1000)
    random.shuffle(xs)
    xs = largest_n(xs, 10)
    assert sorted(xs) == range(990, 1000)
    print xs
Paul Hankin
  • 54,811
  • 11
  • 92
  • 118
  • In worst case quick select can go till O(n^2), though that does not happen far too often unless the pivot selected is the max element. I had median of medians and quick select implemented, I also wanted to get heap to do a comparison. To overcome quick select u can take about 3 random numbers from the list and take their median as the pivot. – gizgok Feb 12 '13 at 04:24
0

How about the following, it should be O(len(xs)).

You simply swap the first n elements with whichever is the largest of the remaining elements.

    def largest_n(xs, n):
        for i in range(n):
            for j in range(i+1,len(xs)):
                if xs[j] > xs [i]:
                    xs[i], xs[j] = xs[j], xs[i]
        return xs[:n]
Lars Prins
  • 41
  • 4