O(n) list subtraction

Question

When working on an AoC puzzle, I found I wanted to subtract lists (preserving ordering):

def bag_sub(list_big, sublist):
    result = list_big[:]
    for n in sublist:
        result.remove(n)
    return result

I didn't like the way the list.remove call (which is itself O(n)) is contained within the loop, that seems needlessly inefficient. So I tried to rewrite it to avoid that:

def bag_sub(list_big, sublist):
    c = Counter(sublist)
    result = []
    for k in list_big:
        if k in c:
            c -= Counter({k: 1})
        else:
            result.append(k)
    return result

Is this now O(n), or does the Counter.__isub__ usage still screw things up?
This approach requires that elements must be hashable, a restriction which the original didn't have. Is there an O(n) solution which avoids creating this additional restriction? Does Python have any better "bag" datatype than collections.Counter?

You can assume sublist is half the length of list_big.

Do these lists have any particular order to them? You can do this in O(n) deterministic time if they're both sorted. — user2357112, Jan 06 '17 at 22:00
I'm not sure what you're doing with Counter there. You could get the same result more clearly by converting sublist to a set and just checking for membership. — Daniel Roseman, Jan 06 '17 at 22:02
@DanielRoseman -- I think that the Counter is handling duplicates (`bag_sub([foo, foo], [foo]) -> [foo]`) — mgilson, Jan 06 '17 at 22:03
@user2357112 No ordering. I know how to do it in O(n log n) by sorting first, and walking a pair of "pointers" down the lists. — wim, Jan 06 '17 at 22:10

mgilson · Answer 1 · 2017-01-06T22:37:02.350

3

I'd use a Counter, but I'd probably do it slightly differently, and I'd probably do this iteratively...

def bag_sub(big_list, sublist):
    sublist_counts = Counter(sublist)
    result = []
    for item in big_list:
        if sublist_counts[item] > 0:
            sublist_counts[item] -= 1
        else:
            result.append(item)
    return result

This is very similar to your solution, but it's probably not efficient to create an entire new counter every time you want to decrement the count on something.¹

Also, if you don't need to return a list, then consider a generator function...

This works as long as all of the elements in list_big and sublist can be hashed. This solution is O(N + M) where N and M are the lengths of list_big and sublist respectively.

If the elements cannot be hashed, you are out of luck unless you have other constraints (e.g. the inputs are sorted using the same criterion). If your inputs are sorted, you could do something similar to the merge stage of merge-sort to determine which elements from bag_sub are in sublist.

^{¹Note that Counters also behave a lot like a defaultdict(int) so it's perfectly fine to look for an item in a counter that isn't there already.}

edited Jan 06 '17 at 22:37

answered Jan 06 '17 at 22:00

mgilson

300,191
65
633
696

This doesn't handle duplicates the way the original code does. – user2357112 Jan 06 '17 at 22:02
1

@user2357112 -- Ahh, I see now. I didn't understand what OP was going for. I've fixed up my solution. – mgilson Jan 06 '17 at 22:10
Better, but you've missed a few nuances of the original code. `c -= Counter({k: 1})` actually operates in-place (on 3.3 and up), and it discards key `k` if the count hits 0, in contrast to something like `c[k] -= 1`, which would preserve the key. – user2357112 Jan 06 '17 at 22:13
@wim -- Can you be more specific? You think that `c -= Counter({k: 1})` is more efficient than `c[k] -= 1`? – mgilson Jan 06 '17 at 22:14
@mgilson Oh, kevin [already explained](http://stackoverflow.com/questions/41514987/on-list-subtraction/41515152#comment70236406_41515045) what I was talking about. – wim Jan 06 '17 at 22:15
@user2357112 -- Huh ... I didn't realize that Counter would drop the key when it dropped to `0`. I always assumed it would decrement to `0`... Interesting. – mgilson Jan 06 '17 at 22:16
Yeah. That's why they call it the bag or multiset in the docs. But they should have implemented <, <=, >, >= for a proper multiset. – wim Jan 06 '17 at 22:18
I actually ended up with something pretty similar to yours anyway --> https://github.com/wimglenn/advent-of-code/blob/0402097678c46ec522dc736091acb4ae2b973f2f/aoc2015/q24.py#L32-L39 – wim Jan 06 '17 at 22:19
A collections.Counter can have explicit 0 values - they're not always dropped - but any nonpositive values will be dropped from the result when you use the multiset operators. – user2357112 Jan 06 '17 at 22:19

user2357112 · Accepted Answer · 2017-01-06T23:04:38.957

Is this now O(n), or does the Counter.__isub__ usage still screw things up?

This would be expected-case O(n), except that when Counter.__isub__ discards nonpositive values, it goes through every key to do so. You're better off just subtracting 1 from the key value the "usual" way and checking c[k] instead of k in c. (c[k] is 0 for k not in c, so you don't need an in check.)

if c[k]:
    c[k] -= 1
else:
    result.append(k)

Is there an O(n) solution which avoids creating this additional restriction?

Only if the inputs are sorted, in which case a standard variant of a mergesort merge can do it.

Does Python have any better "bag" datatype than collections.Counter?

collections.Counter is Python's bag.

score -1 · Answer 3 · answered Jan 06 '17 at 22:00

-1

Removing an item from a list of length N is O(N) if the list is unordered, because you have to find it.
Removing k items from a list of length N, therefore, is O(kN) if we focus on "reasonable" cases where k << N.

So I don't see how you could get it down to O(N).

A concise way to write this:

new_list = [x for x in list_big if x not in sublist]

But that's still O(kN).

answered Jan 06 '17 at 22:00

cadolphs

9,014
1
24
41

2

This doesn't handle duplicates the way the original code does. – user2357112 Jan 06 '17 at 22:02

O(n) list subtraction

3 Answers3