Python calculate co-occurrence of tuples in list of lists of tuples

Question

I have a big list of lists of tuples like

actions  = [ [('d', 'r'), ... ('c', 'e'),('', 'e')],
             [('r', 'e'), ... ('c', 'e'),('d', 'r')],
                                    ... , 
             [('a', 'b'), ... ('c', 'e'),('c', 'h')]
           ]

and i want to find the co-occurrences of the tuples.

I have tried the sugestions from this question but the accepted answer is just too slow. For example in a list with 1494 list of tuple, the resulting dictionary size is 18225703 and took hours to run for 2 tuple co-occurence. So plain permutation and counting doesn't seem to be the answer since i have a bigger list.

I expect the output to somewhat extract the most common pairs (2) or more (3,4,5 at most) tuples that co-occur the most. Using the previous list as example:

('c', 'e'),('d', 'r')

would a common co-occurence when searching for pairs since they co-occur frequently. Is there an efficient method to achieve this?

Have you tried the others answers using `itertools` module ? — Alexandre B., Jul 23 '19 at 16:51
This answer seems quite clean and efficient: https://stackoverflow.com/a/49079618/1467943 — ales_t, Jul 23 '19 at 16:52
Have you tried `Counter`, i.e. `Counter(x for i in actions for x in i)`? — Henry Yik, Jul 23 '19 at 17:13
@AlexandreB. yes, the solution i have implemented is using the itertools to extract the unique tuples, but again it takes way to much time, it finds 6038 unique tuples to then i create the permutations between them to then count the occurences, resulting in a huge dictionary/ sparse matrix. — jcosta, Jul 23 '19 at 20:10
@HenryYik using counter will only give the number of occurences of every single tuple, i need the co-occurences. — jcosta, Jul 23 '19 at 20:14

score 1 · Answer 1 · answered Jul 30 '19 at 14:59

I think there is no hope for a faster algorithm: you have to compute the combinations to count them. However, if there is threshold of co-occurrences under which you are not interested, you can rty to reduce the complexity of the algorithm. In both cases, there is a hope for less space complexity.

Let's take a small example:

>>> actions  = [[('d', 'r'), ('c', 'e'),('', 'e')],
...             [('r', 'e'), ('c', 'e'),('d', 'r')],
...             [('a', 'b'), ('c', 'e'),('c', 'h')]]

General answer

This answer is probably the best for a large list of lists, but you can avoid creating intermediary lists. First, create an iterable on all present pairs of elements (elements are pairs too in your case, but that doesn't matter):

>>> import itertools
>>> it = itertools.chain.from_iterable(itertools.combinations(pair_list, 2) for pair_list in actions)

If we want to see the result, we have to consume the iteratable:

>>> list(it)
[(('d', 'r'), ('c', 'e')), (('d', 'r'), ('', 'e')), (('c', 'e'), ('', 'e')), (('r', 'e'), ('c', 'e')), (('r', 'e'), ('d', 'r')), (('c', 'e'), ('d', 'r')), (('a', 'b'), ('c', 'e')), (('a', 'b'), ('c', 'h')), (('c', 'e'), ('c', 'h'))]

Then count the sorted pairs (with a fresh it!)

>>> it = itertools.chain.from_iterable(itertools.combinations(pair_list, 2) for pair_list in actions)
>>> from collections import Counter
>>> c = Counter((a,b) if a<=b else (b,a) for a,b in it)
>>> c
Counter({(('c', 'e'), ('d', 'r')): 2, (('', 'e'), ('d', 'r')): 1, (('', 'e'), ('c', 'e')): 1, (('c', 'e'), ('r', 'e')): 1, (('d', 'r'), ('r', 'e')): 1, (('a', 'b'), ('c', 'e')): 1, (('a', 'b'), ('c', 'h')): 1, (('c', 'e'), ('c', 'h')): 1})
>>> c.most_common(2)
[((('c', 'e'), ('d', 'r')), 2), ((('', 'e'), ('d', 'r')), 1)]

At least in term of space, this solution should be efficient since everything is lazy and the number of elements of the Counter is the number of combinations from elements in the same list, that is at most N(N-1)/2 where N is the number of distinct elements in all the lists ("at most" because some elements never "meet" each other and therefore some combination never happen).

The time complexity is O(M . L^2) where M is the number of lists and L the size of the largest list.

With a threshold on the co-occurences number

I assume that all elements in a list are distinct. The key idea is that if an element is present in only one list, then this element has strictly no chance to beat anyone at this game: it will have 1 co-occurence with all his neighbors, and 0 with the elements of other lists. If there are a lot of "orphans", it might be useful to remove them before processing computing the combinations:

>>> d = Counter(itertools.chain.from_iterable(actions))
>>> d
Counter({('c', 'e'): 3, ('d', 'r'): 2, ('', 'e'): 1, ('r', 'e'): 1, ('a', 'b'): 1, ('c', 'h'): 1})
>>> orphans = set(e for e, c in d.items() if c <= 1)
>>> orphans
{('a', 'b'), ('r', 'e'), ('c', 'h'), ('', 'e')}

Now, try the same algorithm:

>>> it = itertools.chain.from_iterable(itertools.combinations((p for p in pair_list if p not in orphans), 2) for pair_list in actions)
>>> c = Counter((a,b) if a<=b else (b,a) for a,b in it)
>>> c
Counter({(('c', 'e'), ('d', 'r')): 2})

Note the comprehension: no brackets but parentheses.

If you have K orphans in a list of N elements, your time complexity for that list falls from N(N-1)/2 to (N-K)(N-K-1)/2, that is (if I'm not mistaken!) K.(2N-K-1) combinations less.

This can be generalized: if an element is present in two or less lists, then it will have at most 2 co-occurrences with other elements, and so on.

If this is still to slow, then switch to a faster language.

Python calculate co-occurrence of tuples in list of lists of tuples

1 Answers1

General answer

With a threshold on the co-occurences number