10

I have two python lists:

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

b = ['the', 'when', 'send', 'we', 'us']

I need to filter out all the elements from a that are similar to those in b. Like in this case, I should get:

c = [('why', 4), ('throw', 9), ('you', 1)]

What should be the most effective way?

yo'
  • 811
  • 11
  • 22
khan
  • 7,005
  • 15
  • 48
  • 70

7 Answers7

11

A list comprehension will work.

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]
b = ['the', 'when', 'send', 'we', 'us']
filtered = [i for i in a if not i[0] in b]

>>>print(filtered)
[('why', 4), ('throw', 9), ('you', 1)]
Octipi
  • 835
  • 7
  • 12
  • this is a much elegant way of doing it while keeping the lists as lists, and not treating them as dicts...thank you for the help. – khan Feb 23 '13 at 11:43
  • You should convert `b` to a `set` if you are using the `in` operator. It changes the lookup time from linear to constant, which will make a huge difference when `b` is a long list. So, `c = set(b)` and then `filtered = [i for i in a if not i[0] in c]`. Note that `b` became `c` in the last line. Even on this short list with 5 items, it results in a 25% speed improvement for me. With a longer list (100 items in `b`), it results in a 90% speed improvement. – Carl Apr 17 '20 at 11:44
5

A list comprehension should work:

c = [item for item in a if item[0] not in b]

Or with a dictionary comprehension:

d = dict(a)
c = {key: value for key in d.iteritems() if key not in b}
Blender
  • 289,723
  • 53
  • 439
  • 496
2

in is nice, but you should use sets at least for b. If you have numpy, you could also try np.in1d of course, but if it is faster or not, you should probably try.

# ruthless copy, but use the set...
b = set(b)
filtered = [i for i in a if not i[0] in b]

# with numpy (note if you create the array like this, you must already put
# the maximum string length, here 10), otherwise, just use an object array.
# its slower (likely not worth it), but safe.
a = np.array(a, dtype=[('key', 's10'), ('val', int)])
b = np.asarray(b)

mask = ~np.in1d(a['key'], b)
filtered = a[mask]

Sets also have have the methods difference, etc. which probably are not to useful here, but in general probably are.

seberg
  • 8,785
  • 2
  • 31
  • 30
  • +1 for numpy. Didn't saw your answer before posting my answer. `in1d` is faster than the list comprehension for larger data sets by a factor of 2. – bmu Feb 24 '13 at 10:41
2

As this is tagged with numpy, here is a numpy solution using numpy.in1d benchmarked against the list comprehension:

In [1]: a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

In [2]: b = ['the', 'when', 'send', 'we', 'us']

In [3]: a_ar = np.array(a, dtype=[('string','|S5'), ('number',float)])

In [4]: b_ar = np.array(b)

In [5]: %timeit filtered = [i for i in a if not i[0] in b]
1000000 loops, best of 3: 778 ns per loop

In [6]: %timeit filtered = a_ar[-np.in1d(a_ar['string'], b_ar)]
10000 loops, best of 3: 31.4 us per loop

So for 5 records the list comprehension is faster.

However for large data sets the numpy solution is twice as fast as the list comprehension:

In [7]: a = a * 1000

In [8]: a_ar = np.array(a, dtype=[('string','|S5'), ('number',float)])

In [9]: %timeit filtered = [i for i in a if not i[0] in b]
1000 loops, best of 3: 647 us per loop

In [10]: %timeit filtered = a_ar[-np.in1d(a_ar['string'], b_ar)]
1000 loops, best of 3: 302 us per loop
bmu
  • 35,119
  • 13
  • 91
  • 108
0

Try this :

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

b = ['the', 'when', 'send', 'we', 'us']

c=[]

for x in a:
    if x[0] not in b:
        c.append(x)
print c

Demo: http://ideone.com/zW7mzY

Arpit
  • 12,767
  • 3
  • 27
  • 40
0

Easy way

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]
b = ['the', 'when', 'send', 'we', 'us']
c=[] # a list to store the required tuples 
#compare the first element of each tuple in with an element in b
for i in a:
    if i[0] not in b:
        c.append(i)
print(c)
-1

Use filter:

c = filter(lambda (x, y): False if x in b else True, a)
Rahul Banerjee
  • 2,343
  • 15
  • 16
  • **-1**: If you're using `False if .. else True` or `True if ... else False` then you're doing it wrong – Eric Feb 23 '13 at 09:36
  • Wrong according to a certain "Python style", or wrong due to some other reason? – Rahul Banerjee Feb 23 '13 at 11:22
  • 1
    `X in Y` itself is a boolean statement in python – thkang Feb 23 '13 at 11:35
  • Wrong in the same way as `if x > 1 == True:`, `if (x > 1 == True) == True:` or `if condition: b = True\n else: b = False` – Eric Feb 23 '13 at 11:36
  • 2
    @RahulBanerjee `False if ... else True` is needlessly complex and hard to read - just do `lambda (x, y): x not in b`. Also, this causes a syntax error in Python 3 - you would have to do `lambda x: x[0] not in b` because the form of argument unpacking you use is no longer part of the language. – lvc Feb 23 '13 at 11:39
  • @Eric: I think you are complaining that it is hard-to-read, but I find it easy to read. – Rahul Banerjee Feb 24 '13 at 00:44
  • @lvc: Your answer makes sense (i.e., this syntax is illegal in Python 3). Thanks! – Rahul Banerjee Feb 24 '13 at 00:45
  • 1
    Part of the problem here is that `filter(lambda:...` is inherently hard to read (vs, say, a filtered comprehension). Presumably, you prefer your notation because it includes an `if`. – Eric Feb 24 '13 at 01:26
  • Fair enough. And that might explain why I didn't come up with `lambda (x, y): x not in b` in the first place (which is illegal in 3.0 anyway). – Rahul Banerjee Feb 25 '13 at 06:34
  • @Eric oh really, "filter(labmda" is hard to read ? Probably it is only your opinion, this really depends on experience. However here is only one disadvantage of using filter with lambdas - it's performance. generators and plain if works faster. – Reishin Aug 11 '16 at 16:09
  • @Reishin: It's more than just my opinion - Guido himself wanted to remove them. See a related post [here](http://stackoverflow.com/a/3013722/102441) – Eric Aug 13 '16 at 00:55