1

I have a iterator of int's coming from a table lookup, and need to check if their multiset is contained in a given fixed "multiset" ms. Currently, I sort ms in the beginning, and then sort the int's in my iterator and check the multiset containment (of sorted lists) as follows:

vals = sorted(my_vals)
for it in ... :
    test_vals = sorted( i for i in it )
    if is_sublist(test_vals,vals):
        # do stuff

where

def is_sublist(L1,L2):
    m = len(L1)
    n = len(L2)
    i = j = 0
    while j <= n:
        if i == m:
            return True
        elif j == n:
            return False
        a,b = L1[i], L2[j]
        if a == b:
            i += 1
            j+= 1
        elif a > b:
            j += 1
        else:
            return False
  • Usually, my lists are rather short (1--20 elements)
  • I tried to use Counter, but the time disadvantage of its initialization is worse than the time advantage of the containment test.
  • I do this ~10^6 times, so I should maybe do it in cython

Any ideas or pointers would be nice -- Thanks! (sorry for clicking the post button too early first...)

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Christian
  • 527
  • 6
  • 19

1 Answers1

0
# edit: second attempt in response to Bakuriu's comment
#
from collections import Counter
from itertools import groupby
multiset = Counter(sorted(vals)) # only create one Counter object
for it in ...:
    grouped = groupby(sorted(it))
    if all(len(list(g)) <= multiset[k] for k, g in grouped):
        # do stuff



from operator import eq
# if you are using Python 2
from itertools import izip as zip
from itertools import starmap

vals = sorted(my_vals)
for it in ...:
    test_vals = sorted(it)
    zipped = zip(test_vals, vals)
    # check if test_vals' multiset is contained 
    # in vals' multiset but bale out as soon as
    # non-matching values are found.
    if all(starmap(eq, zipped)):
        # do stuff
superjump
  • 151
  • 4
  • This is **not** equivalent to the OP code. In the `is_sublist` the code may compare the same element from the first argument to multiple elements in the second argument, which is something you *cannot* emulate with a simple `map` without mangling with the iterators. – Bakuriu Jul 14 '14 at 12:31
  • Once you do `sorted(it)`, I doubt there to be anything faster than my `is_sublist`. Also, your `zip` forgets necessary info from `vals`, e.g., if `test_vals` is `[2,2]` and `vals` is `[1,2,2]`, I have a multiset containment, but `zip([2,2],[1,2,2])` cannot determine that. – Christian Jul 14 '14 at 14:49
  • @ChristianStump I was able to compile the `is_sublist` function with numba's @autojit decorator, which would save you the bother of reaching out to Cython. Might be worth a shot if you have access to that package. I'd also be minded to make `vals` a local variable of the function rather than a parameter as this would also give a speed boost. – superjump Jul 14 '14 at 15:03
  • @superjump, I wasn't able to get numba running, but will try again later, thanks for that idea! – Christian Jul 15 '14 at 08:53