1

I have the following list_A:

['0', '1', '2', '3', '4', '5', '6', '7']

and this other list_B:

['2','6','7']

I would like to check this: For each element in "list_A", if it is one of the elements in "list_B"

So:

for 0 <-> are you one of these? ['2','6','7']
for 1 <-> are you one of these? ['2','6','7']
for 2 <-> are you one of these? ['2','6','7']

And at the end, I would like to come up with a "list_C" that is identical to "list_A" in terms of element count but more like a map that looks like that:

['-1', '-1', '2', '-1', '-1', '-1', '6', '7']

Which is: "-1" for every non-matching element and "self" for every matching one. Obviously I am doing this with 2 nested for each cycles, and it works:

myStateMap = []

for a in list_A:
    elementString = -1
    for b in list_B:
        if a == b:
            # Update the elementString in case of a match
            elementString = a
            print "\tMatch"
        else:
            pass
            print "\tNO Match!"
    # Store the elementString
    myStateMap.append(elementString)

The question is: How would you optimize this? How would you make it shorter and more efficient?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
mbilyanov
  • 2,315
  • 4
  • 29
  • 49

4 Answers4

4

You can use a list comprehension:

>>> [('-1' if item not in list_B else item) for item in list_A]
['-1', '-1', '2', '-1', '-1', '-1', '6', '7']
Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
4

Use a list comprehension with a conditional expression:

[i if i in list_B else '-1' for i in list_A]

Demo:

>>> list_A = ['0', '1', '2', '3', '4', '5', '6', '7']
>>> list_B = ['2','6','7']
>>> [i if i in list_B else '-1' for i in list_A]
['-1', '-1', '2', '-1', '-1', '-1', '6', '7']

if list_B is large, you should make it a set instead:

set_B = set(list_B)

to speed up the membership testing. in on a list has linear cost (the more elements need to be scanned, the longer it takes), while the same test against a set takes constant cost (independent of the number of values in the set).

For your specific example, using a set is already faster:

>>> timeit.timeit("[i if i in list_B else '-1' for i in list_A]", "from __main__ import list_A, list_B")
1.8152308464050293
>>> timeit.timeit("set_B = set(list_B); [i if i in set_B else '-1' for i in list_A]", "from __main__ import list_A, list_B")
1.6512861251831055

but if list_A ratios list_B are different and the sizes are small:

>>> list_A = ['0', '1', '2', '3']
>>> list_B = ['2','6','8','10']
>>> timeit.timeit("[i if i in list_B else '-1' for i in list_A]", "from __main__ import list_A, list_B")
0.8118391036987305
>>> timeit.timeit("set_B = set(list_B); [i if i in set_B else '-1' for i in list_A]", "from __main__ import list_A, list_B")
0.9360401630401611

That said, in the general case it is worth your while using sets.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Is that true? https://wiki.python.org/moin/TimeComplexity mentions O(1) on average but O(n) in worst case for set lookups. – Simeon Visser Nov 07 '13 at 14:00
  • @SimeonVisser: the worst case is in the extremely unlikely scenario that all values inserted hash to the same initial slot, resulting in constant hash collisions. Incidentally that is also the reason that Python 3.3 introduced a random hash seed; to prevent a malicious attacker from feeding your application keys guaranteed to create hash collisions and thus DOS your application. – Martijn Pieters Nov 07 '13 at 14:01
  • Thanks, that's interesting. All the more reason to start planning that elusive Python 3 upgrade. – Simeon Visser Nov 07 '13 at 14:05
  • Note that you can still mitigate that attack by limiting the amount of data you are willing to process from untrusted sources. – Martijn Pieters Nov 07 '13 at 14:08
  • You should present the set solution which is the best solution rather than using a list. Even in the worst case scenario, the set solution is faster than the list comprehension. – Samy Arous Nov 07 '13 at 14:12
  • @lcfseth: Take into account that *creating* the set also has a cost, as does hashing each element to test against the set. These two costs together *can* outweigh the cost of a few list membership tests. – Martijn Pieters Nov 07 '13 at 14:15
  • I was thinking about that, more in terms of memory cost than in terms for computing cost as hashing is usually fast and creating a set out of a list is done in O(m) which keep the overall complexity under the O(n+m) limit in average case, which is a lot better than the list solution O(n*m). I still think that unless m is really small, the set solution is still better. – Samy Arous Nov 07 '13 at 14:21
  • @lcfseth: I agree, but the specific cut-off point varies with the size of both lists. Without more detailed statistics on the expected sizes of both lists I cannot make more that vague recommendations. – Martijn Pieters Nov 07 '13 at 14:24
  • @Martijn Pieters :) fair enough! – Samy Arous Nov 07 '13 at 14:26
0

The quickest way to optimize is to use if a in list_B: instead of your inner loop. So the new code would look like:

for a in list_A:
    if a in list_B:
        myStateMap.append(a)
        print '\tMatch'
    else:
        print '\tNO Match!'
        myStateMap.append(-1)
cforbish
  • 8,567
  • 3
  • 28
  • 32
  • If we were talking solely about algorithms, both your solutions are similar, a in list_b does exactly the same thing as the inner loop in the OP question. However, your solution does indeed speed up things mainly because the **in** operator is built-in python and that the lookup is done internally which is way faster than the compiled version. – Samy Arous Nov 07 '13 at 14:24
0

Here's another short list comprehension example that's a little different from the others:

a=[1,2,3,4,5,6,7]
b=[2,5,7]
c=[x * (x in b) for x in a]

Which gives c = [0, 2, 0, 0, 5, 6, 7]. If your list elements are actually strings, like they seem to be, then you either get the empty string '' or the original string. This takes advantage of the implicit conversion of a boolean value (x in b) to either 0 or 1 before multiplying it by the original value (which, in the case of strings, is "repeated concatenation").

twalberg
  • 59,951
  • 11
  • 89
  • 84