22

I have a list of list as in the code I attached. I want to link each sub list if there are any common values. I then want to replace the list of list with a condensed list of list. Examples: if I have a list [[1,2,3],[3,4]] I want [1,2,3,4]. If I have [[4,3],[1,2,3]] I want [4,3,1,2]. If I have [[1,2,3],[a,b],[3,4],[b,c]] I want [[1,2,3,4],[a,b,c]] or [[a,b,c],[1,2,3,4]] I don't care which one.

I am almost there...

My problem is when I have a case like [[1,2,3],[10,5],[3,8,5]] I want [1,2,3,10,5,8] but with my current code I get [1,2,3,8,10,5]

Here is my code:

import itertools

a = [1,2,3]
b = [3,4]
i = [21,22]
c = [88,7,8]
e = [5,4]
d = [3, 50]
f = [8,9]
g=  [9,10]
h = [20,21]

lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000  
#I have a lot of list but not very many different lists

def any_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

def find_uniq(lst):
    ''' return the uniq parts of lst'''
    seen = set()
    seen_add = seen.add
    return [ x for x in lst if x not in seen and not seen_add(x)]

def overlap_inlist(o_lst, lstoflst):
    '''
    Search for overlap, using "any_overlap", of a list( o_lst) in a list of lists (lstoflst).
    If there is overlap add the uniq part of the found list to the search list, and keep 
    track of where that list was found 
    '''
    used_lst =[ ]
    n_lst =[ ]
    for lst_num, each_lst in enumerate(lstoflst):
        if any_overlap(o_lst,each_lst):
            n_lst.extend(each_lst)
            used_lst.append(lst_num)
    n_lst= find_uniq(n_lst)
    return  n_lst, used_lst

def comb_list(lst):
    '''
    For each list in a list of list find all the overlaps using 'ovelap_inlist'.
    Update the list each time to delete the found lists. Return the final combined list. 
    '''
    for updated_lst in lst:
        n_lst, used_lst = overlap_inlist(updated_lst,lst)
        lst[:] = [x for i,x in enumerate(lst) if i not in used_lst]
        lst.insert(0,n_lst)
    return lst
comb_lst = comb_list(lst)
print comb_lst

The out put from this script is:

[[88, 7, 8, 9, 10], [1, 2, 3, 4, 50, 5], [21, 22, 20]]

I want it so the key are in the original order like:

[[88, 7, 8, 9, 10], [1, 2, 3, 4, 5, 50,], [21, 22, 20]]

The 5 and 50 are switched in new lst[2]

I am somewhat new to python. I would appreciate any solutions to the problem or comments on my current code. I am not a computer scientists, I imagine there may be some kind of algorithm that does this quickly( maybe from set theory? ). If there is such an algorithm please point me to the right direction.

I may be making this way more complicated then it is... Thank you!!

jfs
  • 399,953
  • 195
  • 994
  • 1,670
Keith
  • 361
  • 1
  • 5
  • 18

7 Answers7

14

Here's a brute-force approach (it might be easier to understand):

from itertools import chain

def condense(*lists):
    # remember original positions
    positions = {}
    for pos, item in enumerate(chain(*lists)):
        if item not in positions:
            positions[item] = pos

    # condense disregarding order
    sets = condense_sets(map(set, lists))

    # restore order
    result = [sorted(s, key=positions.get) for s in sets]
    return result if len(result) != 1 else result[0]

def condense_sets(sets):
    result = []
    for candidate in sets:
        for current in result:
            if candidate & current:   # found overlap
                current |= candidate  # combine (merge sets)

                # new items from candidate may create an overlap
                # between current set and the remaining result sets
                result = condense_sets(result) # merge such sets
                break
        else:  # no common elements found (or result is empty)
            result.append(candidate)
    return result

Example

>>> condense([1,2,3], [10,5], [3,8,5])
[1, 2, 3, 10, 5, 8]
>>> a = [1,2,3]
>>> b = [3,4]
>>> i = [21,22]
>>> c = [88,7,8]
>>> e = [5,4]
>>> d = [3, 50]
>>> f = [8,9]
>>> g=  [9,10]
>>> h = [20,21]
>>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
>>> condense([1], [2, 3, 2])
[[1], [2, 3]]

Performance comparison

condense_*() functions are from the answers to this question. lst_OP input list from the question (different size), lst_BK - the test list from @Blckknght's answer (different size). See the source.

Measurements show that solutions based on "disjoint sets" and "connected components of undirected graph" concepts perform similar on both types of input.

name                       time ratio comment
condense_hynekcer     5.79 msec  1.00 lst_OP
condense_hynekcer2     7.4 msec  1.28 lst_OP
condense_pillmuncher2 11.5 msec  1.99 lst_OP
condense_blckknght    16.8 msec  2.91 lst_OP
condense_jfs            26 msec  4.49 lst_OP
condense_BK           30.5 msec  5.28 lst_OP
condense_blckknght2   30.9 msec  5.34 lst_OP
condense_blckknght3   32.5 msec  5.62 lst_OP


name                       time  ratio comment
condense_blckknght     964 usec   1.00 lst_BK
condense_blckknght2   1.41 msec   1.47 lst_BK
condense_blckknght3   1.46 msec   1.51 lst_BK
condense_hynekcer2    1.95 msec   2.03 lst_BK
condense_pillmuncher2  2.1 msec   2.18 lst_BK
condense_hynekcer     2.12 msec   2.20 lst_BK
condense_BK           3.39 msec   3.52 lst_BK
condense_jfs           544 msec 563.66 lst_BK


name                       time ratio comment
condense_hynekcer     6.86 msec  1.00 lst_rnd
condense_jfs          16.8 msec  2.44 lst_rnd
condense_blckknght    28.6 msec  4.16 lst_rnd
condense_blckknght2   56.1 msec  8.18 lst_rnd
condense_blckknght3   56.3 msec  8.21 lst_rnd
condense_BK           70.2 msec 10.23 lst_rnd
condense_pillmuncher2  324 msec 47.22 lst_rnd
condense_hynekcer2     334 msec 48.61 lst_rnd

To reproduce results clone gist and run measure-performance-condense-lists.py

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • This should be the accepted answer as it is a much cleaner method. Using the potions dict for initial combined location for the final sorted position along with set combining is the way to go. +1. – sberry Dec 05 '12 at 03:47
  • @J.F. Sebastian I've studied your solution a lot and I still don't understand how it works. It's an incredible algorithm. Moreover it is 2 to 3 times faster than my solution. I upvote !! – eyquem Dec 06 '12 at 00:38
  • @eyquem: It is bad if it is not clear. Simplicity is the only good thing about it. It is the only algorithm that I'm confident just by looking at it that it produces expected results in all cases (eventually) It is slow because it does unnecessary work. For a good algorithm look at [@Blckknght's answer](http://stackoverflow.com/a/13716804/4279). – jfs Dec 06 '12 at 01:48
  • easy to follow and elegant, however recursion may be problematic on large input, I guess stack worst-case is O(N) for N sublists? == O(sqrt(E)) for E total elements/vertices? – Dima Tisnek Dec 20 '12 at 10:01
6

Here's my approach. It uses the concept of a "disjoint set" to first identify which sublists overlap with each other, then it joins them together, eliminating duplicates.

from collections import OrderedDict
def disjoint_set_find(djs, node): # disjoint set find, with path compression
    if node not in djs: # base case, node is a root of a set
        return node
    djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
    return djs[node]

def disjoint_set_union(djs, first, second):
    first = disjoint_set_find(djs, first)   # find root of first set
    second = disjoint_set_find(djs, second) # and of second set
    if first < second: # make smaller root the root of the new combined set
        djs[second] = first
    elif second < first:
        djs[first] = second
    # deliberately ignore the case where first == second (same set)

def condenseBK(*master_list):
    values = OrderedDict() # maps values to the first sublist containing them
    overlaps = {}  # maps sublist indexes to each other to form a disjoint set
    for i, sublist  in enumerate(master_list):
        for v in sublist:
            if v not in values: # no overlap, so just store value
                values[v] = i
            else: # overlap detected, do a disjoint set union
                disjoint_set_union(overlaps, values[v], i)
    output = [] # results
    output_map = {} # map from original indexes to output indexes
    for v, i, in values.items(): # iterate over values in order
        root = disjoint_set_find(overlaps, i)
        if root not in output_map:
            output_map[i] = len(output)
            output.append([]) # create new output sublists as necessary
        output[output_map[root]].append(v)
    return output

Sample output:

>>> a = [1,2,3]
>>> b = [3,4]
>>> c = [88,7,8]
>>> d = [3, 50]
>>> e = [5,4]
>>> f = [8,9]
>>> g = [9,10]
>>> h = [20,21]
>>> i = [21,22]
>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
>>> condenseBK(*lst)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]

An explanation of the algorithm:

By request, here's an explanation for how my code works.

The first two functions implement the find and union operations of a disjoint set. The data structure is implemented with a dictionary mapping nodes to their parent nodes. Any node that is not a key of the dictionary is the root of a set. The find operation returns the root node of the set containing a given node. To help performance a bit, I've implemented "path compression", which reduces the number of recursive steps needed over time. The union operation combines the sets containing its arguments first and second.

The main condense function has two parts. First, it sets up a couple of data structures, then it uses them to build the output.

values is an OrderedDictionary that maps from each value to the index of the first sublist it is contained in. The order each value is added is used to produce the output in the correct order.

overlaps is the dictionary used as for the disjoint set. Its nodes are the integer indexes of overlapping sublists.

The first loops fill up those two data structures. They loop over the sublists and their contents. If a value has not been seen before, it is added to the values dictionary. If it has been seen, the current sublist is overlapping a previous sublist containing that value.

To resolve the overlap, the code does a union of the disjoint sets that contain the two sublists.

The output is built in the output list. Because there are likely to be fewer output sublists than there were in the input, we need an additional dictionary to map between the old indexes (from the input) to the new indexes that apply to the output list.

To fill up the output list, we iterate over the values, which happens in the order they were added thanks to using the OrderedDict class. Using the disjoint set, it can combine the overlapping lists correctly.

This algorithm has very good performance when there are a lot of lists to be processed that don't overlap immediately. For instance, this set of 200 three-element lists ultimately all overlaps, but you only start seeing the overlaps appear after the first 100 have been processed:

lst2 = [list(range(4*i,   4*(i+1)-1)) for i in range(100)] + \
       [list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]
Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • this preserves duplicates inside a list. Could you add a couple of comments to explain how the code works? – jfs Dec 05 '12 at 21:56
  • algorithmically it is a good code but it deserves some explanation – jfs Dec 05 '12 at 22:05
  • @J.F.Sebastian Hmm, I'll try. I cut out a few empty lines to keep it from getting a scrollbar, so I'm not sure if I can add too much inline with the code without undoing that. – Blckknght Dec 05 '12 at 22:08
  • e.g., you could extract the while loops into a separate function (code block) or just describe it separately from the main code; it would give a place to explain the while loop in detail and it might make the main code both shorter and more clear. – jfs Dec 05 '12 at 22:24
  • Good idea. I've refactored the disjoint set `find` operation which is used several times. Since it's now a function, I made it use recursion rather than a while loop, which allows it to do "path compression" which can improve performance. I added a bunch of comments and a longer text description of the algorithm at the end. – Blckknght Dec 06 '12 at 01:07
  • Thank you for the explanation. It is more clear what is a general union-find algorithm and what is specific to the current question. [I've modified the output loop to include only unique elements.](https://gist.github.com/a9743791863c8227e17c) – jfs Dec 06 '12 at 02:37
  • @J.F.Sebastian I just made another update. I factored out the union operation too, since it's logically separate from the rest. Then, inspired by @eyquem's answer, I found a way to use an `OrderedDict` to control the order of the output and avoid duplicates in the ouptut all in one go. This algorithm is very fast for some kinds of input where the sublists don't overlap much (at first). I posted an example list that it handles much faster than your brute force algorithm (about 16x). My algorith's a little slower on the questioner's example list, but not by much (6% or so). – Blckknght Dec 06 '12 at 03:00
  • @J.F.Sebastian I found the exception "RuntimeError: maximum recursion depth exceeded" for the testcase `[[i] for i in range(1000)] + [[i, i - 1] for i in range(999, 0, -1)] + [[999]]`. This is because `sys.getrecursionlimit()` is usually 1000. – hynekcer Dec 20 '12 at 13:51
  • The exception was for the current version here that is marked blckknght2 by J.F.Sebastian (and also for blckknght3) The original version blckknght is very slow (9 seconds) for data `n = 5000` `[[i] for i in range(n)] + [[i, i - 1] for i in range(n - 1, 0, -1)] + [[n - 1, n - 2]] * n` because it is O(n^2). It gives incorrect output for `[[1, 1]]`. – hynekcer Dec 20 '12 at 15:31
  • @hynekcer: disjoint_set_find() is written for readability. It is a tail-recursive function with memoization. [It is easy to convert it to its iterative analog](https://gist.github.com/612a75a543fe6115cacc). If I'm not mistaken; the recursion is limited by `len(master_list)` in the worst case – jfs Dec 20 '12 at 22:15
  • @J.F.Sebastian: Thanks for fix. I ignored testing of blckknght* for long time due to high possible resursion level. There is also a very slow testcase for all blckknght*: `[[i] for i in range(n)] + [[i, i - 1] for i in range(n, 0, -1)] + [[i] for i in range(n, 0, -1)]`. It is 100 times slower than pillmuncher2 for `n = 10000` because blckknght complexity is O(n^2) in the worst case. However I understand that blckknght's and your contribution was fundamental for progress of this question. – hynekcer Dec 21 '12 at 11:34
  • @J.F.Sebastian: The simplified iterative version loses a bit of the benefits of the path compression, since it only saves the value found on the bottommost level, rather than on all levels of there recursion. It isn't quite tail recursive, I guess. Also, if you're interested in the disjoint set data structure (AKA `UnionFind`), I just wrote up a more complete version in another answer [here](http://stackoverflow.com/a/14000584/1405065). – Blckknght Dec 22 '12 at 06:33
4

I am sure there is a cleaner way to do this but I started down a certain path and did what I had to to make it work without any refactoring.

lookup = {}
out = []
index = 0
for grp in lst:
    keys = [lookup.get(num, None) for num in grp]
    keys = [key for key in keys if key is not None]
    if len(keys):
        if len(set(keys)) != 1:
            for num in grp:
                out[keys[0]].append(num)
            seen = set()
            keys = [key for key in keys if key not in seen and not seen.add(key)]
            for key in keys[1:]:
                out[keys[0]].extend(out[key])
                del out[key]
            seen = set()
            out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
        else:
            for num in grp:
                lookup[num] = keys[0]
            out[keys[0]].extend(grp)
            seen = set()
            out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
    else:
        out.append(grp)
        for num in grp:
            lookup[num] = index
        index += 1

print out

Thanks to @Steven for the list reduction technique with the set.

OUTPUT

[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
sberry
  • 128,281
  • 18
  • 138
  • 165
  • This solution is also much much faster then what I was doing. For a list of 50,000 lists ( only 10 different kinds) your code takes 0.558s mine took 30.6 seconds ( and did not give what I wanted :)) Thank you very much for your help. I clearly have some learning to do. – Keith Dec 05 '12 at 02:48
  • @keithchem: if `lst == [[1,2,3], [10,5], [3,8,5]]` then @sberry's code produces `out == [[1, 2, 3, 8, 5, 10], [10, 5]]` but it should be `[1, 2, 3, 8, 5, 10]` (or `[[1, 2, 3, 8, 5, 10]]`). – jfs Dec 05 '12 at 03:20
  • @J.F.Sebastian, fixed... I think. – sberry Dec 05 '12 at 03:42
  • 1
    @keithchem: J.F.Sebastian has provided a much cleaner answer. Read through it piece by piece (maybe add some print statements along the way) to see exactly what it does. It is a good solution and should be the accepted answer. – sberry Dec 05 '12 at 03:50
  • @sberry: It is better. But I've provided incorrect output in the previous comment. It should be [`[1, 2, 3, 10, 5, 8]`, not `[1, 2, 3, 8, 5, 10]`](http://ideone.com/QsyS5E) (your code) – jfs Dec 05 '12 at 03:52
4

Your problem is essentially a graph theoretic one, the problem of connected components, with an added requirement regarding the order of the elements of each component.

In your program, the set of all lists forms an undirected graph, where each list is a node in the graph. We say two lists are connected directly if they have common elements, and connected indirectly, if there exists a third list to which both are connected, either directly or indirectly. Given e.g. three lists [a, b], [b, c] and [c, d], then [a, b] and [b, c] are connected directly, as well as [b, c] and [c, d], but [a, b] and [c, d] are connected indirectly, since while they don't share common elements, they both share elements with the same list [b, c].

A group of nodes is a connected component if all nodes in the group are connected (directly or indirectly) and no other node in the graph is connected to any node in the group.

There is a fairly simple linear time algorithm that generates all connected components in an undirected graph. Using that, we can define a function that generates all lists of condensed disjoint lists, while keeping the order of their elements:

from itertools import imap, combinations_with_replacement
from collections import defaultdict

def connected_components(edges):
    neighbors = defaultdict(set)
    for a, b in edges:
        neighbors[a].add(b)
        neighbors[b].add(a)
    seen = set()
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        unseen = set([node])
        next_unseen = unseen.pop
        while unseen:
            node = next_unseen()
            see(node)
            unseen |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield component(node)

def condense(lists):
    tuples = combinations_with_replacement(enumerate(imap(tuple, lists)), 2)
    overlapping = ((a, b) for a, b in tuples
                          if not set(a[1]).isdisjoint(b[1]))
    seen = set()
    see = seen.add
    for component in connected_components(overlapping):
        yield [item for each in sorted(component)
                    for item in each[1]
                    if not (item in seen or see(item))]

print list(condense([[1, 2, 3], [10, 5], [3, 8, 5], [9]]))
print list(condense([[1, 2, 3], [5, 6], [3, 4], [6, 7]]))

Result:

[[1, 2, 3, 10, 5, 8], [9]]
[[5, 6, 7], [1, 2, 3, 4]]

Time complexity of condense() is quadratic, since every list must be tested against every other list to find out if they have common elements. Therefore, the performance is awful. Can we improve it somehow? Yes.

Two lists are connected directly if they have common elements. We can turn this relationship around: two elements are connected directly if they belong to the same list, and connected indirectly if there exists a third element that is connected (directly or indirectly) to both of them. Given e.g. two lists [a, b] and [b, c], then a and b are connected directly, as well as b and c, and therefore a and c are connected indirectly. If we also adapt J.F.Sebastians idea of storing the position of each element's first occurrence, we can implement it like so:

def condense(lists):
    neighbors = defaultdict(set)
    positions = {}
    position = 0
    for each in lists:
        for item in each:
            neighbors[item].update(each)
            if item not in positions:
                positions[item] = position
                position += 1
    seen = set()
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        unseen = set([node])
        next_unseen = unseen.pop
        while unseen:
            node = next_unseen()
            see(node)
            unseen |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield sorted(component(node), key=positions.get)

It still uses the connected components algorithm, but this time we view elements as connected, not lists. The results are the same as before, but since time complexity is now linear, it runs much faster.

pillmuncher
  • 10,094
  • 2
  • 35
  • 33
4

I tried to write a fast and readable solution. It is never much slower than other solutions, if I know, but can be sometimes much faster because it is additionally optimized for longer sublist or for many sublists that are subset of any yet existing group. (This is motivated by text of the question "I have a lot of list but not very many different lists.") The code uses less memory only for condensed data that can be much less than the original data. It can work e.g. with a generator collecting data from a realtime process. The estimate of complexity is O(n log n). I think that no algorithm that uses sorting can be of linear complexity.

def condense(lists):
    groups = {}         # items divided into groups {id(the_set): the_set,...}
    members = {}        # mapping from item to group
    positions = {}      # mapping from item to sequential ordering
    iposition = 0       # counter for positions
    for sublist in lists:
        if not sublist or members.get(sublist[0], set()).issuperset(sublist):
            continue    # speed-up condition if all is from one group
        common = set([x for x in sublist if x in members])
        if common:      # any common group can be a destination for merge
            dst_group = members[common.pop()]
            common = common - dst_group   # are more groups common for sublist?
            while common:
                grp = members[common.pop()]
                if len(grp) > len(dst_group):   # merge shorter into longer grp
                    grp, dst_group = dst_group, grp
                dst_group.update(grp)
                for item in grp:
                    members[item] = dst_group
                del groups[id(grp)]
                common = common - dst_group
        else:           # or create a new group if nothing common
            dst_group = set()
            groups[id(dst_group)] = dst_group
        newitems = []
        for item in sublist:    # merge also new items
            if item not in positions:
                positions[item] = iposition
                iposition += 1
                newitems.append(item)
                members[item] = dst_group
        dst_group.update(newitems)
    return [sorted(x, key=positions.get) for x in groups.values()]

It is faster than pillmuncher2 for subslists longer than approximately 8 items because it can work on more items together. It is also very fast for lists with many similar sublists or many sublists that are subset of any yet existing group. It is faster by 25% over pillmuncher2 for lst_OP, however slower by 15% for lst_BK.

An example of test data with long sublists is [list(range(30)) + [-i] for i in range(100)].

I intentionally wrote "common = common - dst_group" instead of using the set operator -= or "set.difference_update", because the updade in-place is not effective if the set on the right side is much bigger then on the left side.


Modified pillmuncher's solution for easier readability. The modification is a little slower than the original due to replacing a generator by list.append. Probably the most readable fast solution.

# Modified pillmuncher's solution
from collections import defaultdict

def condense(lists):
    neighbors = defaultdict(set)  # mapping from items to sublists
    positions = {}                # mapping from items to sequential ordering
    position = 0
    for each in lists:
        for item in each:
            neighbors[item].update(each)
            if item not in positions:
                positions[item] = position
                position += 1
    seen = set()
    see = seen.add
    for node in neighbors:
        if node not in seen:
            unseen = set([node])      # this is a "todo" set
            next_unseen = unseen.pop  # method alias, not called now
            group = []                # collects the output
            while unseen:
                node = next_unseen()
                see(node)
                unseen |= neighbors[node] - seen
                group.append(node)
            yield sorted(group, key=positions.get)
hynekcer
  • 14,942
  • 6
  • 61
  • 99
  • I've added your solution to [the comparison](http://stackoverflow.com/a/13715626/4279) – jfs Dec 20 '12 at 02:20
3
class List(list): pass

rv = dict()

def condense_step():
    """find and merge one overlapping pair in rv"""
    for i, iv in rv.items():
        for j, jv in rv.items():
            if i != j and i.intersection(j):
                m = i.union(j)
                del rv[i]
                del rv[j]
                rv.setdefault(m, [])
                rv[m] += iv
                rv[m] += jv
                return True

def unique(l):
    """flatten list-of-lists excluding duplicates"""
    seen = set()
    for i in sum(l, []):
        if i not in seen:
            seen.add(i)
            yield i

def condense(inp):
    rv.clear()
    inp = map(List, inp)

    for i in range(len(inp)):
        inp[i].order = i
        rv.setdefault(frozenset(inp[i]), [])
        rv[frozenset(inp[i])].append(inp[i])

    while condense_step():
        pass

    for v in rv.values():
        v.sort(key=lambda x: x.order)

    return [list(unique(i)) for i in rv.values()]
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
  • Can you please write any text about your solution? What is the main advatage or what were your prefereces? I think you consider your work very good. – hynekcer Dec 19 '12 at 22:54
  • I started this as an exercise in python/self-explanatory algorithm, I guess it went a bit beyond that. Compared to other good solutions posted, mine is iterative, there is no recursion; it uses pythonic fronzenset as key to dict; it uses classes to tag original order; it can be converted to a streaming algorithm. Other than that it follows same idea as others, save order, condense as sets, restore order. – Dima Tisnek Dec 20 '12 at 09:58
  • Yes, your algorithm is a crystalline nice the first incarnation of an idea. A disadvantage is the time complexity O(n^3). It takes 200 seconds for the testcase `[[i] for i in range(1000)] + [[i, -i] for i in range(1000)]`. – hynekcer Dec 20 '12 at 11:48
2

This solution uses only an Ordered Dictionary.
deepcopy() is necessary if one wants the original copy to remain unchanged.

from collections import OrderedDict
from copy import deepcopy

def treat(passed_list):
    L = deepcopy(passed_list)

    dic = OrderedDict()
    for subl in L:
        for x in subl:
            if x not in dic:
                dic[x] =  subl
    print 'dic at start'
    print '\n'.join('%-3s : %s' % (a,dic[a])
                    for a in dic) + '\n'

    for sublist in L:
        short = []
        short.extend(el for el in sublist
                     if el not in short)
        seen  = []
        for k,val in dic.iteritems():
            if val is sublist:
                break
            if k in short:
                if val not in seen:
                    seen.append(val)

        sumseen = []
        for elseen in seen:
            for y in elseen:
                sumseen.append(y)
                dic[y] = sumseen

        if seen:
            for el in sublist:
                if el not in sumseen:
                    sumseen.append(el)
                    dic[el] = sumseen

        sublist[:] = short

    cumul = []
    cumul.extend(lu for lu in dic.itervalues()
                 if lu not in cumul)
    return cumul


plus = [[1,2,3,2,1],[10,5,5,5,10],
        [8,5,3,3,5],[45,50,12,45,40,12]]

lst = [[1,2,3], [10,5], [3,8,5]]

for one_list in (plus,lst):
    print 'one_list before == %r\n' % one_list
    print 'treat(one_list) == %r\n' % treat(one_list)
    print 'one_list after  == %r\n' % one_list
    print '===================================='

result

one_list before == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]

dic at start
1   : [1, 2, 3, 2, 1]
2   : [1, 2, 3, 2, 1]
3   : [1, 2, 3, 2, 1]
10  : [10, 5, 5, 5, 10]
5   : [10, 5, 5, 5, 10]
8   : [8, 5, 3, 3, 5]
45  : [45, 50, 12, 45, 40, 12]
50  : [45, 50, 12, 45, 40, 12]
12  : [45, 50, 12, 45, 40, 12]
40  : [45, 50, 12, 45, 40, 12]

treat(one_list) == [[1, 2, 3, 10, 5, 8], [45, 50, 12, 40]]

one_list after  == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]

====================================
one_list before == [[1, 2, 3], [10, 5], [3, 8, 5]]

dic at start
1   : [1, 2, 3]
2   : [1, 2, 3]
3   : [1, 2, 3]
10  : [10, 5]
5   : [10, 5]
8   : [3, 8, 5]

treat(one_list) == [[1, 2, 3, 10, 5, 8]]

one_list after  == [[1, 2, 3], [10, 5], [3, 8, 5]]

====================================

This solution has an inconvenience: it is 2 to 3 times slower than the J.F. Sebastian's solution.

eyquem
  • 26,771
  • 7
  • 38
  • 46