Find max overlap in list of lists

Question

I have two lists of lists:

a = [[0, 1, 5], [2], [3], [4], [6, 7], [8, 9, 10, 11], [12], [13], [14], [15]]
b = [[0, 1], [2, 3], [4], [5], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

How can I find the maximum overlap between the values of the lists and build a new list of lists with this maximum overlap. In other words, I'm looking for a function f which maximizes the list sizes by merging lists with overlap.

The desired result of function f for this example would be:

f(a,b) = [[0, 1, 5], [2, 3], [4], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

Say `a` contains `[1,2],[3,4]` and `b` contains `[2,3]` should the result contain `[1,2,3,4]`? — Willem Van Onsem, Mar 06 '17 at 13:56
If the case mentioned by @WillemVanOnsem in comments is True, then why `[6, 7], [8, 9, 10, 11]` is not present as `[6, 7, 8, 9, 10, 11]` in the desired output? — Moinuddin Quadri, Mar 06 '17 at 13:58
@MoinuddinQuadri Because the is no *bridge* (`[7, 8]`) I assume.. — Ma0, Mar 06 '17 at 13:59
@MoinuddinQuadri: bacause there is no list that contains elements of the two lists. You only *unify* them if there is a list that has instances in both. — Willem Van Onsem, Mar 06 '17 at 13:59
@Chris_Rands I'm wondering if there is already a function for this. Before implementing it on my own, I wanted to check that. But I have no idea how to search for that... — elcombato, Mar 06 '17 at 13:59
My feeling tells me that this problem might have *non-unique* solutions — Ma0, Mar 06 '17 at 14:07
@StefanPochmann It is guaranteed that there is not item in multiple lists within `a` or `b` — elcombato, Mar 06 '17 at 14:09

score 8 · Accepted Answer · edited May 23 '17 at 12:09

You can use a variant of the disjoint-set structure to solve this problem: for each list [a,b,c] you unify a with b and a with c. You do this for both lists and then derive the resulting roots.

Here there is a simply disjunct-set algorithm we can modify:

from collections import defaultdict

def parent(u,mapping):
    if mapping[u] == u:
        return u
    mapping[u] = parent(mapping[u],mapping)
    return mapping[u]

def relation(array,mapping=None):
    if mapping is None:
        mapping = {}

    for e in array:
        if len(e) > 0:
            u = e[0]
            if u not in mapping:
                mapping[u] = u
            for v in e[1:]:
                if v not in mapping:
                    mapping[v] = v
                mapping[parent(u,mapping)] = parent(v,mapping)
    return mapping

def f(a,b):
    mapping = {}
    relation(a,mapping)
    relation(b,mapping)

    results = defaultdict(set)
    for u in mapping.keys():
        results[parent(u,mapping)].add(u)
    return [list(x) for x in results.values()]

(boldface added for the semantical differences with the original union-set algorithm).

This produces:

>>> f(a,b)
[[2, 3], [4], [0, 1, 5], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

The result is not sorted, since we work with a set. Nevertheless, you can easily sort it on the first element of each tuple if you want by altering f to:

def f(a,b):
    mapping = {}
    relation(a,mapping)
    relation(b,mapping)

    results = defaultdict(set)
    for u in mapping.keys():
        results[parent(u,mapping)].add(u)
    return sorted([list(x) for x in results.values()],key=lambda t:t[0])

which produces:

>>> f(a,b)
[[0, 1, 5], [2, 3], [4], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

The nice thing with this solution is that it also works if there is overlap in a or b itself, and you can easily generalize the solution to work with an arbitrary amount of lists (for instance a, b and c).

score 0 · Answer 2 · answered Mar 06 '17 at 14:46

When I understood it right, the following will do it:

[l for l in a if not any(all(x in l2 for x in l) for l2 in b)] + 
[l for l in b if not any(all(x in l2 for x in l) for l2 in a)] + 
[l for l in a if l in b]

The first term yields all lists in a which are not part of lists in b; the second term yields all lists in b which are note part if lists in a; the third term yield all lists, which are both in a and b.

For your example this yields the following result:

[[0, 1, 5], [2, 3], [13, 14], [4], [6, 7], [8, 9, 10, 11], [12], [15]]

This will not always generate the transitive closure I think. Say you have `a = [[1,2],[3,4],[5,6]]` and `b = [[2,3],[4,5]]` the result is `[[1, 2], [3, 4], [5, 6], [2, 3], [4, 5]]` whereas the expected result is `[[1,2,3,4,5,6]]`... — Willem Van Onsem, Mar 06 '17 at 14:52

Find max overlap in list of lists

2 Answers2