8

I have two sorted lists, e.g.

a = [1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]

I want to know for each item in a if it is in b. For the above example, I want to find

a_in_b = [True, True, False, False]

(or having the indices where a_in_b is True would be fine too).

Now, both a and b are very large, so complexity is an issue. If M = len(a) and N = len(b). How can I do this with a complexity lower than M * O(N) by making use of the fact that both lists are sorted?

Tom de Geus
  • 5,625
  • 2
  • 33
  • 77
  • I'm not sure, but maybe `set.difference()` would be useful? And I think your complexity is `O(n*n)`. – Jonas Palačionis Jan 19 '21 at 10:10
  • You can do it by iterating both in lockstep, but really the fastest is the usual "convert the second to a set". Either way is O(n+m). Is there a specific reason why you want to exploit that both are sorted? – MisterMiyagi Jan 19 '21 at 10:21
  • Related: [How to create a binary list based on inclusion of list elements in another list](https://stackoverflow.com/q/16393681/7851470) – Georgy Jan 28 '21 at 10:13

10 Answers10

6

You can iterate over your b list manually within a loop over a. You'll want to advance the b iteration when the latest value you've seen from it is less than the current value from a.

from math import inf

result = []
b_iter = iter(b)                           # create an iterator over b
b_val = -inf
for a_val in a:
    while b_val < a_val:
        b_val = next(b_iter, inf)          # manually iterate on it
    result.append(a_val == b_val)

This should have a running time of O(M+N), since each list item gets iterated over at most once (b may not even need to be fully iterated).

You could avoid using floating point infinities if you want to, but you'd need to do a bit of extra work to handle some edge cases (e.g. if b is empty).

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • Thanks!! Do I understand correctly that, depending on the properties of `b`, it might be good to use a binary search (using `bisect`) instead of the manual iteration? – Tom de Geus Jan 19 '21 at 10:36
  • No binary search is slower (m * log(n)). This algorithm is similar to merging of two sorted lists. – hivert Feb 20 '21 at 08:50
6

Exploiting sorted'ness is a red-herring for time complexity: The ideal case is to iterate both in lockstep for O(n+m) complexity. This is the same as converting b to a set for O(m), then searching the elements of a in the set for O(n).

>>> a = [1, 4, 7, 8]
>>> b = [1, 2, 3, 4, 5, 6]
>>> bs = set(b)                 # create set for O(len(b))
>>> [item in bs for item in a]  # check O(len(a)) items "in set of b" for O(1) each
[True, True, False, False]

Since most of these operations are builtin, the only costly operation is the iteration over a which is needed in all solutions.

However, this will duplicate the references to the items in b. If b is treated as external to the algorithm, the space complexity is O(m+n) instead of the ideal case O(n) for just the answer.

MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119
  • Thanks! I'm having trouble understanding though, not in the last place because the answer seems incomplete. Do you mean by `bs` the result of a binary search? – Tom de Geus Jan 19 '21 at 10:33
  • 1
    @TomdeGeus Ah, sorry. Missed a line. Code edited, you can run it directly now. – MisterMiyagi Jan 19 '21 at 10:35
  • 2
    It's not a red herring; the solution which exploits sortedness has the same time complexity, but uses O(1) auxiliary space compared to O(m) auxiliary space for the set-based solution. – kaya3 Jan 19 '21 at 10:44
3

Late answer, but a different approach to the problem using set() uniqueness and O(1) speed of len(), i. e. :

a_in_b = []
a = [1,4,7,8]
b = [1,2,3,4,5,6]
b_set = set(b) 
for v in a:
    l1 = len(b_set) 
    b_set.add(v) 
    a_in_b.append(l1 == len(b_set)) 

Unfortunately, my approach isn't the fastest:

  • mistermiyagi: 0.387 ms
  • tomerikoo: 0.442 ms
  • blckknght: 0.729 ms
  • lobito: 1.043 ms
  • semisecure: 1.87 ms
  • notnotparas: too long
  • lucky6qi: too long

Benchmark

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
2

Use Binary Search here:

def bs(b,aele,start,end):
    if start > end:
        return False
    mid = (start + end) // 2
    if ale == b[mid]:
        return True

    if ale < b[mid]:
        return bs(b, aele, start, mid-1)
    else:
        return bs(b, aele, mid+1, end)

For each element in a check if it exists in b using this method. Time Complexity: O(m*log(n))

notnotparas
  • 177
  • 1
  • 11
  • 1
    There's a binary search module in the standard library, so you don't need to write your own. Check out `bisect`. But you can do better in this specific case! – Blckknght Jan 19 '21 at 10:22
  • yes i've used `bisect` before. just thought it would be helpful to mention the code. – notnotparas Jan 19 '21 at 10:28
  • You are not using that a is sorted ! – hivert Feb 20 '21 at 08:52
  • @hivert using that fact, we can solve this problem by using some modified merge sort algorithm, but I think that would make the time complexity and space complexity O(m+n) – notnotparas Feb 20 '21 at 10:16
2

Using sets the order doesn't even matter.

Turn b to a set (O(N)). Then iterate a (O(M)), and for each element check if it's in set_b (O(1)). This will give a time complexity of O(max(M, N)):

a = [1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]

set_b = set(b)
res = []
for elem in a:
    res.append(elem in set_b)

This can of-course be shortened to a nifty list-comp:

res = [elem in set_b for elem in a]

Both give:

[True, True, False, False]

For your parenthesized request, simply iterate with enumerate instead:

for i, elem in enumerate(a):
    if elem in set_b:
        res.append(i)

Which will give [0, 1].

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
1

The obvious solution is actually O(M + N):

a = [1, 1, 4, 7, 8]
b = [1, 2, 3, 4, 5, 6]
c = [0] * len(a) # Or use a dict to stash hits ..

j = 0

for i in range(0, len(a)):
  while j < len(b) - 1 and b[j] < a[i]:
    j += 1
  if b[j] == a[i]:
    c[i] = 1

print(c)

For each i in 0 ... N where N is length of a, only a unique partition / sub-sequence of b plus one border number is checked, making it O(M + N) all together.

spinkus
  • 7,694
  • 4
  • 38
  • 62
1

Go through a and b once:

a_in_b = []
bstart = 0
for ai in a:
    print (ai,bstart)
    if bstart == len(b):
        a_in_b.append(False)
    else:
        for bi in b[bstart:]:
            print (ai, bi, bstart)
            if ai == bi:
                a_in_b.append(True)
                break
            elif ai > bi:
                if bstart < len(b):
                    bstart+=1
                if bstart == len(b):
                    a_in_b.append(False)
                continue
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
lucky6qi
  • 965
  • 7
  • 10
  • 3
    The slice operation `b[bstart:]` is `O(N)`, which is going to kill your performance. – Blckknght Jan 19 '21 at 10:38
  • 1
    Thanks! Would `b[bstart:] not lead to a lot allocations (`b` being quite large)? Also, here it might be worth to consider a binary search to advance (depending of the properties of `b`) I guess? – Tom de Geus Jan 19 '21 at 10:39
1

You should use binary search algorithm(read about it if you don't know what it is).

The modified bin_search function has to return position right for which b[right] >= elem - the first element in b that is greater or equal than searched element from a. This position will be used as the left position for next bin_search call. Also bin_search returns True as a second argument if it have found elem in b

def bin_search(arr, elem, left):
    right = len(arr)
    while left < right:
        mid = (left+right)//2
        if arr[mid] == elem:
            return (mid, True)
        if arr[mid] < elem:
            left = mid + 1
        else:
            right = mid
    return (right, False)

def find_a_in_b(a, b):
    new_left = 0
    a_in_b = [False] * len(a)
    
    # we could have used enumerate but size of a is too large
    index = 0
    for i in a:
        new_left, a_in_b[index] = bin_search(b, i, new_left)
        index += 1
    return a_in_b

It's probably the best time

P.S. Forget it, i'm stupid and forgot about linear algorithm used in merge sort, so it's not the best

0
for el in a:
    try:
        b = b[b.index(el):]
        a_in_b.append("True")
    except:
        a_in_b.append("False")
panos
  • 328
  • 1
  • 4
  • 16
0

A simple solution is to convert the lists to a data frame and do an inner merge

The inner join matches like values on a specific column

Golden Lion
  • 3,840
  • 2
  • 26
  • 35