Python - Removing overlapping lists

Question

Say I have a list of lists that has indexes [[start, end], [start1, end1], [start2, end2]].

Like for example :

[[0, 133], [78, 100], [25, 30]].

How would I get check for overlap between among the lists and remove the list with the longer length each time? So:

>>> list = [[0, 133], [78, 100], [25, 30]]
>>> foo(list)
[[78, 100], [25, 30]]

This is what I tried to do so far:

def cleanup_list(list):
    i = 0
    c = 0
    x = list[:]
    end = len(x)
    while i < end-1:
        for n in range(x[i][0], x[i][1]):
            if n in range(x[i+1][0], x[i+1][1]):
                list.remove(max(x[i], x[i+1]))
        i +=1
    return list

But in addition to being kind of convoluted it's not working properly:

 >>>cleanup_list([[0,100],[9,10],[12,90]])
 [[0, 100], [12, 90]]

Any help would be appreciated!

EDIT:

Other examples would be:

>>>a = [[0, 100], [4, 20], [30, 35], [30, 78]]
>>>foo(a)
[[4, 20], [30, 35]]

>>>b = [[30, 70], [25, 40]]
>>>foo(b)
[[25, 40]]

I'm basically trying to remove all of the longest lists that overlap with another list. In this case I don't have to worry about the lists being of equal length.

Thanks!!

`([[0,100],[9,10],[12,90]])` should go to `[[0,100]]` correct? — HennyH, May 01 '13 at 04:49
I fear this will be an ill-defined problem in case of three or more overlaps — wim, May 01 '13 at 04:50
I'm trying to actually remove [0, 100] and get [[9, 10], [12, 90]] — user2338068, May 01 '13 at 04:50
@user2338068: Can you show us some sample input / output combinations — Ayush, May 01 '13 at 04:55
What happens if there are multiple places where there are overlaps (also if they are the same length)? — Marc Laugharn, May 01 '13 at 04:55
None of the overlapping lists will be at the same length. If there are multiple overlaps I'm trying to just return the list with the shortest length. — user2338068, May 01 '13 at 05:07
To clarify: the result list contains no overlapping intervals and you want to keep as many intervals as possible ("longest overlapping intervals are removed") i.e., maximize `len(foo(a))` (many non-overlapping intervals in the result)? — jfs, May 01 '13 at 05:51

jfs · Accepted Answer · 2013-05-01T23:34:20.507

To remove a minimal number of intervals from the list such that the intervals that are left do not overlap, O(n*log n) algorithm exists:

def maximize_nonoverlapping_count(intervals):
    # sort by the end-point
    L = sorted(intervals, key=lambda (start, end): (end, (end - start)),
               reverse=True) # O(n*logn)
    iv = build_interval_tree(intervals) # O(n*log n)
    result = []
    while L: # until there are intervals left to consider
        # pop the interval with the smallest end-point, keep it in the result
        result.append(L.pop()) # O(1)
        # remove intervals that overlap with the popped interval
        overlapping_intervals = iv.pop(result[-1]) # O(log n + m)
        remove(overlapping_intervals, from_=L) 
    return result

It should produce the following results:

f = maximize_nonoverlapping_count
assert f([[0, 133], [78, 100], [25, 30]]) == [[25, 30], [78, 100]]
assert f([[0,100],[9,10],[12,90]]) == [[9,10], [12, 90]]
assert f([[0, 100], [4, 20], [30, 35], [30, 78]]) == [[4, 20], [30, 35]]
assert f([[30, 70], [25, 40]]) == [[25, 40]]

It requires the data structure that can find in O(log n + m) time all intervals that overlap with the given interval e.g., IntervalTree. There are implementations that can be used from Python e.g., quicksect.py, see Fast interval intersection methodologies for the example usage.

Here's a quicksect-based O(n**2) implementation of the above algorithm:

from quicksect import IntervalNode

class Interval(object):
    def __init__(self, start, end):
        self.start = start
        self.end = end
        self.removed = False

def maximize_nonoverlapping_count(intervals):
    intervals = [Interval(start, end) for start, end in intervals]
    # sort by the end-point
    intervals.sort(key=lambda x: (x.end, (x.end - x.start)))   # O(n*log n)
    tree = build_interval_tree(intervals) # O(n*log n)
    result = []
    for smallest in intervals: # O(n) (without the loop body)
        # pop the interval with the smallest end-point, keep it in the result
        if smallest.removed:
            continue # skip removed nodes
        smallest.removed = True
        result.append([smallest.start, smallest.end]) # O(1)

        # remove (mark) intervals that overlap with the popped interval
        tree.intersect(smallest.start, smallest.end, # O(log n + m)
                       lambda x: setattr(x.other, 'removed', True))
    return result

def build_interval_tree(intervals):
    root = IntervalNode(intervals[0].start, intervals[0].end,
                        other=intervals[0])
    return reduce(lambda tree, x: tree.insert(x.start, x.end, other=x),
                  intervals[1:], root)

Note: the time complexity in the worst case is O(n**2) for this implementation because the intervals are only marked as removed e.g., imagine such input intervals that len(result) == len(intervals) / 3 and there were len(intervals) / 2 intervals that span the whole range then tree.intersect() would be called n/3 times and each call would execute x.other.removed = True at least n/2 times i.e., n*n/6 operations in total:

n = 6
intervals = [[0, 100], [0, 100], [0, 100], [0, 10], [10, 20], [15, 40]])
result = [[0, 10], [10, 20]]

Here's a banyan-based O(n log n) implementation:

from banyan import SortedSet, OverlappingIntervalsUpdator # pip install banyan

def maximize_nonoverlapping_count(intervals):
    # sort by the end-point O(n log n)
    sorted_intervals = SortedSet(intervals,
                                 key=lambda (start, end): (end, (end - start)))
    # build "interval" tree O(n log n)
    tree = SortedSet(intervals, updator=OverlappingIntervalsUpdator)
    result = []
    while sorted_intervals: # until there are intervals left to consider
        # pop the interval with the smallest end-point, keep it in the result
        result.append(sorted_intervals.pop()) # O(log n)

        # remove intervals that overlap with the popped interval
        overlapping_intervals = tree.overlap(result[-1]) # O(m log n)
        tree -= overlapping_intervals # O(m log n)
        sorted_intervals -= overlapping_intervals # O(m log n)
    return result

Note: this implementation considers [0, 10] and [10, 20] intervals to be overlapping:

f = maximize_nonoverlapping_count
assert f([[0, 100], [0, 10], [11, 20], [15, 40]]) == [[0, 10] ,[11, 20]]
assert f([[0, 100], [0, 10], [10, 20], [15, 40]]) == [[0, 10] ,[15, 40]]

sorted_intervals and tree can be merged:

from banyan import SortedSet, OverlappingIntervalsUpdator # pip install banyan

def maximize_nonoverlapping_count(intervals):
    # build "interval" tree sorted by the end-point O(n log n)
    tree = SortedSet(intervals, key=lambda (start, end): (end, (end - start)),
                     updator=OverlappingIntervalsUpdator)
    result = []
    while tree: # until there are intervals left to consider
        # pop the interval with the smallest end-point, keep it in the result
        result.append(tree.pop()) # O(log n)

        # remove intervals that overlap with the popped interval
        overlapping_intervals = tree.overlap(result[-1]) # O(m log n)
        tree -= overlapping_intervals # O(m log n)
    return result

Thanks so much! I'm new to coding so some of this stuff is a tad confusing - but I'll read through your links. But to clarify, is build_interval_tree a function that I'll have to create that's similar to the code in quicksect.py? — user2338068, May 01 '13 at 08:50
@user2338068: I've added a quicksect-based implementation, to show how it could look like (you can run it if you download `quicksect.py` and put it in the same directory as the script). — jfs, May 01 '13 at 18:51
@user2338068: I've added `banyan`-based `O(n log n)` implementation (`banyan` is not pure Python (C++ is used to write C extension for Python) so it might make it harder to install). — jfs, May 01 '13 at 23:01
A lot of this is over my head right now, but I'll go through it as best as I can. Thanks man, I really appreciate it. — user2338068, May 02 '13 at 22:18

AlessandroEmm · Answer 2 · 2013-05-02T07:36:02.193

3

This may not be the fastest solution, but really verbose and clear - I think.

a = [[2,100], [4,10], [77,99], [38,39], [44,80], [69,70], [88, 90]]

# build ranges first
def expand(list):
    newList = []
    for r in list:
        newList.append(range(r[0], r[1] + 1))
    return newList


def compare(list):
    toBeDeleted = []
    for index1 in range(len(list)):
        for index2 in range(len(list)):
            if index1 == index2:
                # we dont want to compare ourselfs
                continue
            matches = [x for x in list[index1] if x in list[index2]]
            if len(matches) != 0: # do we have overlap?
                ## compare lengths and get rid of the longer one
                if   len(list[index1]) > len(list[index2]):
                    toBeDeleted.append(index1)
                    break
                elif len(list[index1]) < len(list[index2]):
                    toBeDeleted.append(index2)
    # distinct
    toBeDeleted = [ toBeDeleted[i] for i,x in enumerate(toBeDeleted) if x not in toBeDeleted[i+1:]] 
    print len(list)
    # remove items
    for i in toBeDeleted[::-1]:
        del list[i] 
    return list


print(compare(expand(a)))

edited May 02 '13 at 07:36

answered May 01 '13 at 08:42

AlessandroEmm

698
7
23

This solution might delete too much i.e., it might return less non-overlapping intervals than possible to preserve. It is `O(n**4)` unnecessarily. It is not clear whether the `distinct` computation is correct, do you mean `toBeDeleted = set(toBeDeleted)` or something else? `list` is a builtin name, avoid using it as a variable. – jfs May 02 '13 at 00:53
You are right: its not speedy (especially compared to yours) but it's more verbose and easier to maintain IMHO. toBeDeleted = set(toBeDeleted) does just what my for comprehension does, its just more obvious if you're not aware of set(). Maybe I got the problem wrong, but I haven't noticed any non-overlapping removals. Thanks for your opinion! – AlessandroEmm May 02 '13 at 07:37
To be clear 1. This solution doesn't work: [`[[0, 10], [9, 12], [11, 20]]` -> `[[9, 10, 11, 12]]`](http://ideone.com/y5PqER) instead of `[[0, 10], [11, 20]]` 2. If it were working; it would be *unnecessary* slow e.g., even for a small input such as `len(a) == 100`, this solution requires `~100**4` (`~100000000`) operations. It is easy to make it both `O(n**2)` (`~10000` operations) and more readable (though I find it debatable that my `O(n log n)` (`~500` operations) version (especially `banyan`-based solution that uses single `SortedSet`) is less readable than the above `O(n**4)` solution). – jfs May 03 '13 at 10:55

score 2 · Answer 3 · answered May 01 '13 at 06:08

I think one problem in your code is that it does not handle with the situation where one list contains another. For example, [0,100] contains [9,10]. When you loop n in [0,100] and n enters [9,10], the condition statement if n in range(x[i+1][0], x[i+1][1]) is triggered. Then the builtin function max will compare [0, 100] and [9, 10], and unfortuantely max will return [9,10] because it compare the first number in the list. Thus you remove the wrong element.

I'm trying another way to achieve the effect you want. Rather than manipulating the list itself, I create a new list. Conditionally appending new element to it if it meets our requirements.

def cleanup_list(lists):
    ranges = []
    for l in lists:
        to_insert = True
        for i in ranges:
            r = range(i[0],i[1])
            # if l overlaps with i, but l does not contain i
            if l[0] in r or l[1] in r:
                if (l[1]-l[0]) < len(r):
                    ranges.remove(i)
                else:
                    to_insert = False
            # l contains i
            if l[0]<i[0] and l[1]>i[1]:
                to_insert = False
        if to_insert:
            ranges.append(l)
    return ranges

Maybe this problem can be addressed by heap. Since all lists do not overlap, they can be represented by a binary heap, with `start` and `end` as value. This should optimize the time complexity to log(n) — Herrington Darkholme, May 01 '13 at 06:39

score 1 · Answer 4 · answered May 01 '13 at 06:07

1

ascending sort all items by length.
add them to a segment tree and ignore overlapped items.

answered May 01 '13 at 06:07

richselian

731
4
18

So would this be an optimized way to go about it? I'll look into it - thanks! – user2338068 May 01 '13 at 06:39

score 1 · Answer 5 · answered May 01 '13 at 06:32

1

In general, two intervals are overlapping if:

min([upperBoundOfA, upperBoundOfB]) >= max([lowerBoundOfA, lowerBoundOfB])

If this is the case, the union of those intervals is:

(min([lowerBoundOfA, lowerBoundOfB]), max([upperBoundOfA, upperBoundOfB])

Similarly, the intersection of those intervals will be:

(min([upperBoundOfA, upperBoundOfB]), max([lowerBoundOfA, lowerBoundOfB]))

answered May 01 '13 at 06:32

Joel Cornett

24,192
9
66
88

I don't think finding the intersection or union helps in this particular case, but this is some very useful general advice I'll keep in mind. Thanks! – user2338068 May 01 '13 at 07:18

Python - Removing overlapping lists

5 Answers5

Linked