3

I have a list of tuples (each tuple consists of 2 numbers) like:

array = [(1, 2), (1, 3), (2, 4), (5, 8), (8, 10)]

Lets say, these numbers are ids of some db objects (records) and inside a tuple, there are ids of duplicate objects. Which means 1 and 2 are duplicate. 1 and 3 are duplicate which means 2 and 3 are also duplicate.

if a == b and b == c then a == c

Now I want to merge all these duplicate objects ids into a single tuple like this:

output = [(1, 2, 3, 4), (5, 8, 10)]

I know I can do this using loops and redundant matches. I just want some better solution with low processing / calculations (if there is any).

Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
Zohaib Ijaz
  • 21,926
  • 7
  • 38
  • 60
  • 1
    You will - afaik - never get rid of at least one loop to do this. And probably multiple will be necessary. You can however use a disjoint-set datastructure to make it rather efficient: https://en.wikipedia.org/wiki/Disjoint-set_data_structure – Willem Van Onsem Feb 06 '17 at 13:42
  • Why don't you show your attempt with loops; this can be a good starting point for others to improve on – Chris_Rands Feb 06 '17 at 13:45

4 Answers4

4

You can use a data structure making it more efficient to perform a merge. Here you create some sort of opposite tree. So in your example you first would create the numbers listed:

1  2  3  4  5  8  10

Now if you iterate over the (1,2) tuple, you look up 1 and 2 in some sort of dictionary. You search their ancestors (there are none here) and then you create some sort of merge node:

1  2  3  4  5  8  10
 \/
 12

Next we merge (1,3) so we look up the ancestor of 1 (12) and 3 (3) and perform another merge:

1  2  3  4  5  8  10
 \/   |
 12  /
   \/
  123

Next we merge (2,4) and (5,8) and (8,10):

1  2  3  4  5  8  10
 \/   |  |   \/   |
 12  /   |   58  /
   \/   /      \/
  123  /      5810
     \/
    1234

You also keep a list of the "merge-heads" so you can easily return the elements.

Time to get our hands dirty

So now that we know how to construct such a datastructure, let's implement one. First we define a node:

class Merge:

    def __init__(self,value=None,parent=None,subs=()):
        self.value = value
        self.parent = parent
        self.subs = subs

    def get_ancestor(self):
        cur = self
        while cur.parent is not None:
            cur = cur.parent
        return cur

    def __iter__(self):
        if self.value is not None:
            yield self.value
        elif self.subs:
            for sub in self.subs:
                for val in sub:
                    yield val

Now we first initialize a dictionary for every element in your list:

vals = set(x for tup in array for x in tup)

and create a dictionary for every element in vals that maps to a Merge:

dic = {val:Merge(val) for val in vals}

and the merge_heads:

merge_heads = set(dic.values())

Now for each tuple in the array, we lookup the corresponding Merge object that is the ancestor, we create a new Merge on top of that, remove the two old heads from the merge_head set and add the new merge to it:

for frm,to in array:
    mra = dic[frm].get_ancestor()
    mrb = dic[to].get_ancestor()
    mr = Merge(subs=(mra,mrb))
    mra.parent = mr
    mrb.parent = mr
    merge_heads.remove(mra)
    merge_heads.remove(mrb)
    merge_heads.add(mr)

Finally after we have done that we can simply construct a set for each Merge in merge_heads:

resulting_sets = [set(merge) for merge in merge_heads]

and resulting_sets will be (order may vary):

[{1, 2, 3, 4}, {8, 10, 5}]

Putting it all together (without class definition):

vals = set(x for tup in array for x in tup)
dic = {val:Merge(val) for val in vals}
merge_heads = set(dic.values())
for frm,to in array:
    mra = dic[frm].get_ancestor()
    mrb = dic[to].get_ancestor()
    mr = Merge(subs=(mra,mrb))
    mra.parent = mr
    mrb.parent = mr
    merge_heads.remove(mra)
    merge_heads.remove(mrb)
    merge_heads.add(mr)
resulting_sets = [set(merge) for merge in merge_heads]

This will worst case run in O(n2), but you can balance the tree such that the ancestor is found in O(log n) instead, making it O(n log n). Furthermore you can short-circuit the list of ancestors, making it even faster.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
2

You can use disjoint set.

Disjoint set is actually a kind of tree structure. Let's consider each number as a tree node, and every time we read in an edge (u, v), we just easily associate the two trees u and v in (if it does not exist, create an one-node tree instead) by pointing the root node of one tree to another's. At the end, we should just walk through the forest to get the result.

from collections import defaultdict


def relation(array):

    mapping = {}

    def parent(u):
        if mapping[u] == u:
            return u
        mapping[u] = parent(mapping[u])
        return mapping[u]

    for u, v in array:
        if u not in mapping:
            mapping[u] = u
        if v not in mapping:
            mapping[v] = v
        mapping[parent(u)] = parent(v)

    results = defaultdict(set)

    for u in mapping.keys():
        results[parent(u)].add(u)

    return [tuple(x) for x in results.values()]

In the code above, mapping[u] stores the ancestor of node u (parent or root). Specially, the ancestor of one-node tree's node is itself.

hsfzxjy
  • 1,242
  • 4
  • 14
  • 22
1

See my comment on Moinuddin's answer : this accepted answer does not validates the tests that you can found at http://rosettacode.org/wiki/Set_consolidation#Python . I did not dig it up though.

I would make a new proposition, based on Willem's answer. The problem in this proposition is the recursivity in the get_ancestor calls : why should we climb up the tree each time we are asked our ancestor, when we could just remember the last root found (and still climb up from that point in case it changed). Indeed, Willem's algorithm is not linear (something like nlogn or n²) while we could remove this non-linearity just as easily.

Another problem comes from the iterator : if the tree is too deep (I had the problem in my use case), you get a Python Exception (Too much recursion) inside the iterator. So instead of building a full tree, we should merge sub leafs (and instead of having branches with 2 leafs, we build branches with N leafs).

My version of the code is as follow :

class Merge:

    def __init__(self,value=None,parent=None,subs=None):
        self.value = value
        self.parent = parent
        self.subs = subs
        self.root = None
        if self.subs:
            subs_a,subs_b = self.subs
            if subs_a.subs:
                subs_a = subs_a.subs
            else:
                subs_a = [subs_a]
            if subs_b.subs:
                subs_b = subs_b.subs
            else:
                subs_b = [subs_b]
            self.subs = subs_a+subs_b

            for s in self.subs:
                s.parent = self
                s.root = None
    def get_ancestor(self):
        cur = self if self.root is None else self.root
        while cur.parent is not None:
            cur = cur.parent
        if cur != self:
            self.root = cur
        return cur

    def __iter__(self):
        if self.value is not None:
            yield self.value
        elif self.subs:
            for sub in self.subs:
                for val in sub:
                    yield val
def treeconsolidate(array):
    vals = set(x for tup in array for x in tup)
    dic = {val:Merge(val) for val in vals}
    merge_heads = set(dic.values())
    for settomerge in array:
        frm = settomerge.pop()
        for to in settomerge:
            mra = dic[frm].get_ancestor()
            mrb = dic[to].get_ancestor()
            if mra == mrb:
                continue
            mr = Merge(subs=[mra,mrb])
            merge_heads.remove(mra)
            merge_heads.remove(mrb)
            merge_heads.add(mr)
    resulting_sets = [set(merge) for merge in merge_heads]
    return resulting_sets

In small merges, this will not change many things but my experience shows that climbing up the tree in huge sets of many elements can cost a lot : in my case, I have to deal with 100k sets, each of them containing between 2 and 1000 elements, and each element may appear in 1 to 1000 sets...

Goulou
  • 698
  • 1
  • 7
  • 12
-1

I think the most efficient way to achieve this will be using set as:

def transitive_cloure(array):
    new_list = [set(array.pop(0))]  # initialize first set with value of index `0`

    for item in array:
        for i, s in enumerate(new_list):
            if any(x in s for x in item):
                new_list[i] = new_list[i].union(item)
                break
        else:
            new_list.append(set(item))
    return new_list

Sample run:

>>> transitive_cloure([(1,2), (1,3), (2,4), (5,8), (8,10)])
[{1, 2, 3, 4}, {8, 10, 5}]

Comparison with other answers (on Python 3.4):

  • This answer: 6.238126921001822

    >>> timeit.timeit("moin()", setup="from __main__ import moin")
    6.238126921001822
    
  • Willem's solution: 29.115453064994654 (Time related to declaration of class is excluded)

    >>> timeit.timeit("willem()", setup="from __main__ import willem")
    29.115453064994654
    
  • hsfzxjy's solution: 10.049749890022213

    >>> timeit.timeit("hsfzxjy()", setup="from __main__ import hsfzxjy")
    10.049749890022213
    
Community
  • 1
  • 1
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
  • 1
    Thank you Moin. I think your code is short and fast. I also tried this with different values. – Zohaib Ijaz Feb 06 '17 at 15:03
  • This implementation seems wrong : See tests at http://rosettacode.org/wiki/Set_consolidation#Python In particular, the following test fails : transitive_cloure([{A,B}, {C,D}, {D,B}]) gives {{'C', 'D'}, {'A', 'B', 'D'}} instead of [{'A', 'C', 'B', 'D'}] – Goulou Mar 28 '20 at 08:08
  • This answer doesn't work for transitive_cloure([(1,2),(3,4),(2,3)]), alot of such cases – amit thakur Jun 29 '21 at 18:57