'in' for two sorted lists with the lowest complexity

Question

I have two sorted lists, e.g.

a = [1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]

I want to know for each item in a if it is in b. For the above example, I want to find

a_in_b = [True, True, False, False]

(or having the indices where a_in_b is True would be fine too).

Now, both a and b are very large, so complexity is an issue. If M = len(a) and N = len(b). How can I do this with a complexity lower than M * O(N) by making use of the fact that both lists are sorted?

I'm not sure, but maybe `set.difference()` would be useful? And I think your complexity is `O(n*n)`. — Jonas Palačionis, Jan 19 '21 at 10:10
You can do it by iterating both in lockstep, but really the fastest is the usual "convert the second to a set". Either way is O(n+m). Is there a specific reason why you want to exploit that both are sorted? — MisterMiyagi, Jan 19 '21 at 10:21
Related: [How to create a binary list based on inclusion of list elements in another list](https://stackoverflow.com/q/16393681/7851470) — Georgy, Jan 28 '21 at 10:13

score 6 · Answer 1 · answered Jan 19 '21 at 10:19

You can iterate over your b list manually within a loop over a. You'll want to advance the b iteration when the latest value you've seen from it is less than the current value from a.

from math import inf

result = []
b_iter = iter(b)                           # create an iterator over b
b_val = -inf
for a_val in a:
    while b_val < a_val:
        b_val = next(b_iter, inf)          # manually iterate on it
    result.append(a_val == b_val)

This should have a running time of O(M+N), since each list item gets iterated over at most once (b may not even need to be fully iterated).

You could avoid using floating point infinities if you want to, but you'd need to do a bit of extra work to handle some edge cases (e.g. if b is empty).

Thanks!! Do I understand correctly that, depending on the properties of `b`, it might be good to use a binary search (using `bisect`) instead of the manual iteration? — Tom de Geus, Jan 19 '21 at 10:36
No binary search is slower (m * log(n)). This algorithm is similar to merging of two sorted lists. — hivert, Feb 20 '21 at 08:50

MisterMiyagi · Answer 2 · 2021-02-07T08:12:17.517

6

Exploiting sorted'ness is a red-herring for time complexity: The ideal case is to iterate both in lockstep for O(n+m) complexity. This is the same as converting b to a set for O(m), then searching the elements of a in the set for O(n).

>>> a = [1, 4, 7, 8]
>>> b = [1, 2, 3, 4, 5, 6]
>>> bs = set(b)                 # create set for O(len(b))
>>> [item in bs for item in a]  # check O(len(a)) items "in set of b" for O(1) each
[True, True, False, False]

Since most of these operations are builtin, the only costly operation is the iteration over a which is needed in all solutions.

However, this will duplicate the references to the items in b. If b is treated as external to the algorithm, the space complexity is O(m+n) instead of the ideal case O(n) for just the answer.

edited Feb 07 '21 at 08:12

answered Jan 19 '21 at 10:29

MisterMiyagi

44,374
10
104
119

Thanks! I'm having trouble understanding though, not in the last place because the answer seems incomplete. Do you mean by `bs` the result of a binary search? – Tom de Geus Jan 19 '21 at 10:33
1

@TomdeGeus Ah, sorry. Missed a line. Code edited, you can run it directly now. – MisterMiyagi Jan 19 '21 at 10:35
2

It's not a red herring; the solution which exploits sortedness has the same time complexity, but uses O(1) auxiliary space compared to O(m) auxiliary space for the set-based solution. – kaya3 Jan 19 '21 at 10:44

Pedro Lobito · Answer 3 · 2021-02-16T20:21:23.230

3

Late answer, but a different approach to the problem using set() uniqueness and O(1) speed of len(), i. e. :

a_in_b = []
a = [1,4,7,8]
b = [1,2,3,4,5,6]
b_set = set(b) 
for v in a:
    l1 = len(b_set) 
    b_set.add(v) 
    a_in_b.append(l1 == len(b_set))

Unfortunately, my approach isn't the fastest:

mistermiyagi: 0.387 ms
tomerikoo: 0.442 ms
blckknght: 0.729 ms
lobito: 1.043 ms
semisecure: 1.87 ms
notnotparas: too long
lucky6qi: too long

Benchmark

edited Feb 16 '21 at 20:21

answered Feb 07 '21 at 07:46

Pedro Lobito

94,083
31
258
268

So this basically tests whether v was unseen, i.e. not in the b set before being added? – MisterMiyagi Feb 07 '21 at 08:06
It tests if the length of the `b_set` is the same after append an element of `a`, if it's the same, then the element already existed (`true`) because a `set` cannot have duplicate values. – Pedro Lobito Feb 07 '21 at 08:44
This won't work if there are duplicates in ``a``, does it? – MisterMiyagi Feb 07 '21 at 09:45
@MisterMiyagi I worked with the example provided, but answering directly to your question, no, it won't work when there's dups in `a`. – Pedro Lobito Feb 07 '21 at 09:57
1

It's still nice to have. Took me a while to wrap my head around. ^^ – MisterMiyagi Feb 07 '21 at 10:45

score 2 · Answer 4 · answered Jan 19 '21 at 10:17

2

Use Binary Search here:

def bs(b,aele,start,end):
    if start > end:
        return False
    mid = (start + end) // 2
    if ale == b[mid]:
        return True

    if ale < b[mid]:
        return bs(b, aele, start, mid-1)
    else:
        return bs(b, aele, mid+1, end)

For each element in a check if it exists in b using this method. Time Complexity: O(m*log(n))

answered Jan 19 '21 at 10:17

notnotparas

177
1
11

1

There's a binary search module in the standard library, so you don't need to write your own. Check out `bisect`. But you can do better in this specific case! – Blckknght Jan 19 '21 at 10:22
yes i've used `bisect` before. just thought it would be helpful to mention the code. – notnotparas Jan 19 '21 at 10:28
You are not using that a is sorted ! – hivert Feb 20 '21 at 08:52
@hivert using that fact, we can solve this problem by using some modified merge sort algorithm, but I think that would make the time complexity and space complexity O(m+n) – notnotparas Feb 20 '21 at 10:16

Tomerikoo · Answer 5 · 2021-01-19T10:43:38.963

2

Using sets the order doesn't even matter.

Turn b to a set (O(N)). Then iterate a (O(M)), and for each element check if it's in set_b (O(1)). This will give a time complexity of O(max(M, N)):

a = [1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]

set_b = set(b)
res = []
for elem in a:
    res.append(elem in set_b)

This can of-course be shortened to a nifty list-comp:

res = [elem in set_b for elem in a]

Both give:

[True, True, False, False]

For your parenthesized request, simply iterate with enumerate instead:

for i, elem in enumerate(a):
    if elem in set_b:
        res.append(i)

Which will give [0, 1].

edited Jan 19 '21 at 10:43

answered Jan 19 '21 at 10:29

Tomerikoo

18,379
16
47
61

I think the OP wants to keep the order with the `[True, False]` notation. If not then this works really fast. – Jonas Palačionis Jan 19 '21 at 10:32
2

@JonasPalačionis This indeed keeps the order of the `True`/`False`... – Tomerikoo Jan 19 '21 at 10:33
Ah, my bad, thought you made `a` a set too! – Jonas Palačionis Jan 19 '21 at 10:34

spinkus · Answer 6 · 2021-01-19T13:34:09.140

The obvious solution is actually O(M + N):

a = [1, 1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]
c = [0] * len(a) # Or use a dict to stash hits ..

j = 0

for i in range(0, len(a)):
  while j < len(b) - 1 and b[j] < a[i]:
    j += 1
  if b[j] == a[i]:
    c[i] = 1

print(c)

For each i in 0 ... N where N is length of a, only a unique partition / sub-sequence of b plus one border number is checked, making it O(M + N) all together.

score 1 · Answer 7 · edited Jan 19 '21 at 10:39

1

Go through a and b once:

a_in_b = []
bstart = 0
for ai in a:
    print (ai,bstart)
    if bstart == len(b):
        a_in_b.append(False)
    else:
        for bi in b[bstart:]:
            print (ai, bi, bstart)
            if ai == bi:
                a_in_b.append(True)
                break
            elif ai > bi:
                if bstart < len(b):
                    bstart+=1
                if bstart == len(b):
                    a_in_b.append(False)
                continue

edited Jan 19 '21 at 10:39

Tomerikoo

18,379
16
47
61

answered Jan 19 '21 at 10:23

lucky6qi

965
7
10

3

The slice operation `b[bstart:]` is `O(N)`, which is going to kill your performance. – Blckknght Jan 19 '21 at 10:38
1

Thanks! Would `b[bstart:] not lead to a lot allocations (`b` being quite large)? Also, here it might be worth to consider a binary search to advance (depending of the properties of `b`) I guess? – Tom de Geus Jan 19 '21 at 10:39

score 1 · Answer 8 · answered Jan 19 '21 at 10:27

You should use binary search algorithm(read about it if you don't know what it is).

The modified bin_search function has to return position right for which b[right] >= elem - the first element in b that is greater or equal than searched element from a. This position will be used as the left position for next bin_search call. Also bin_search returns True as a second argument if it have found elem in b

def bin_search(arr, elem, left):
    right = len(arr)
    while left < right:
        mid = (left+right)//2
        if arr[mid] == elem:
            return (mid, True)
        if arr[mid] < elem:
            left = mid + 1
        else:
            right = mid
    return (right, False)

def find_a_in_b(a, b):
    new_left = 0
    a_in_b = [False] * len(a)
    
    # we could have used enumerate but size of a is too large
    index = 0
    for i in a:
        new_left, a_in_b[index] = bin_search(b, i, new_left)
        index += 1
    return a_in_b

It's probably the best time

P.S. Forget it, i'm stupid and forgot about linear algorithm used in merge sort, so it's not the best

panos · Answer 9 · 2021-01-19T10:38:22.057

0

for el in a:
    try:
        b = b[b.index(el):]
        a_in_b.append("True")
    except:
        a_in_b.append("False")

edited Jan 19 '21 at 10:38

answered Jan 19 '21 at 10:26

panos

328
1
4
16

Thanks!! Wouldn't the slice keep performance here though? – Tom de Geus Jan 19 '21 at 10:42
`b.index` actually iterates the list so this is `O(M*N)`... – Tomerikoo Jan 19 '21 at 10:44
Not dramatically. Of course, it depends on the size of a list. I woudn't use a hashable object having a scenario like this in my mind: a = [1,2,2,3,4] b=[1,2,2,2,2,2,1000,1001] – panos Jan 19 '21 at 10:49
Still, it's iterating the remaining `b` every time and creating new lists each iteration surely takes its toll... – Tomerikoo Jan 19 '21 at 10:55
@Tomerikoo slicing the list avoids the O(M*N), but I see your point! – panos Jan 19 '21 at 11:03

score 0 · Answer 10 · answered Jan 23 '21 at 03:28

0

A simple solution is to convert the lists to a data frame and do an inner merge

The inner join matches like values on a specific column

answered Jan 23 '21 at 03:28

Golden Lion

3,840
2
26
35

'in' for two sorted lists with the lowest complexity

10 Answers10

Linked