3

I have a list of element, label pairs like this: [(e1, l1), (e2, l2), (e3, l1)]

I have to count how many labels two element have in common - ie. in the list above e1and e3have the label l1 in common and thus 1 label in common.

I have this Python implementation:

def common_count(e_l_list):
    count = defaultdict(int)
    l_list = defaultdict(set)

    for e1, l in e_l_list:
        for e2 in l_list[l]:
            if e1 == e2:
                continue
            elif e1 > e2:
                count[e1,e2] += 1
            else:
                count[e2,e1] += 1

         l_list[l].add(e1)

    return count

It takes a list like the one above and computes a dictionary of element pairs and counts. The result for the list above should give {(e1, e2): 1}

Now i have to scale this to millions of elements and labels and i though Cython would be a good solution to save CPU time and memory. But i can't find docs on how to use maps in Cython.

How would i implement the above in pure Cython?

It can be asumed that all elements and labels are unsigned integers.

Thanks in advance :-)

user28906
  • 91
  • 1
  • 5
  • would it make more sense to have the label as the key and the elements as the values to that label? If you set it up that way then converting a list of 10 million elements and ~50000 labels to a dict only takes a few seconds. Switching keys and values while sorting the elements takes roughly half the time to make the dict. – SirParselot Feb 05 '16 at 15:19
  • Im not sure i follow you... how would i be able to look up how many common labels 2 elements have? – user28906 Feb 05 '16 at 15:28
  • Ah, I misunderstood your question. I thought you were looking for all elements that shared a label. So your labels and elements will not necessarily be unique? – SirParselot Feb 05 '16 at 15:33
  • Nope, i need the specific count of labels in common between two elements :-) – user28906 Feb 05 '16 at 15:38
  • well I think you are over complicating it a bit then. You can create a dict keyed on element with corresponding labels as values. Then when you want to compare two elements, turn their lists into sets and perform an intersection on them and get the length of the result. – SirParselot Feb 05 '16 at 15:43

1 Answers1

3

I think you are trying to over complicate this by creating pairs of elements and storing all common labels as the value when you can create a dict with the element as the key and have a list of all values associated with that element. When you want to find common labels convert the lists to a set and perform an intersection on them, the resulting set will have the common labels between the two. The average time of the intersection, checked with ~20000 lists, is roughly 0.006 or very fast

I tested this with this code

from collections import *
import random
import time

l =[]
for i in xrange(10000000):
    #With element range 0-10000000 the dictionary creation time increases to ~16 seconds 
    l.append((random.randrange(0,50000),random.randrange(0,50000)))

start = time.clock()    
d = defaultdict(list)
for i in l: #O(n)
    d[i[0]].append(i[1]) #O(n) 

print time.clock()-start

times = []
for i in xrange(10000):
    start = time.clock()
    tmp = set(d[random.randrange(0,50000)]) #picks a random list of labels
    tmp2 = set(d[random.randrange(0,50000)]) #not guaranteed to be a different list but more than likely
    times.append(time.clock()-start)
    common_elements = tmp.intersection(tmp2) 
print sum(times)/100.0

18.6747529999 #creation of list
4.17812619876 #creation of dictionary
0.00633531142994 #intersection

Note: The times do change slightly depending on number of labels. Also creating the dict might be too long for your situation but that is only a one time operation.

I would also highly not recommend creating all pairs of elements. If you have 5,000,000 elements and they all share at least one label, which is worst case, then you are looking at 1.24e+13 pairs or, more bluntly, 12.5 trillion. That would be ~1700 terabytes or ~1.7 petabytes

SirParselot
  • 2,640
  • 2
  • 20
  • 31
  • Im sorry but its not really helpfull to suggest something thats completely different. Im interested in solving the problem because it's important that it get it solved like that for the rest of the program flow. – user28906 Feb 10 '16 at 17:04
  • @user28906 using a map is barely going to save you any time. The expensive part is creating all the pairs. If your program needs the pairs then you probably need to rethink how you are doing this. The easiest way isn't always the best way – SirParselot Feb 10 '16 at 18:06