Why numpy .isin function gives incorrect output

Question

My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.

So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.

I stumbled across this link: check if values of a column are in values of another numpy array column in pandas

In order to the numpy.isin, I tried running below code:

dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
                   'col_a': [1,2,5,1,2],
                   'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt

    id  col_a   col_b
0   a   1       2
1   a   2       2
2   a   5       [2, 5, 4]
3   b   1       4
4   b   2       [1, 5, 6, 3, 2]

When I enter:

np.isin(dt['col_a'], dt['col_b'])

The output is:

array([False,  True, False, False,  True])

Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.

Where as if I change the value to 4 as below:

dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
                   'col_a': [1,2,4,1,2],
                   'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})

dt

    id  col_a   col_b
0   a   1       2
1   a   2       2
2   a   4       [2, 5, 4]
3   b   1       4
4   b   2       [1, 5, 6, 3, 2]

and execute same code:

np.isin(dt['col_a'], dt['col_b'])

I get correct result:

array([False,  True,  True, False,  True])

Can someone please let me know why it's giving different results.

'vectorized' isn't possible. `colb` is object dtype containg a mix of numbers and lists that vary in length. — hpaulj, Nov 11 '22 at 15:51
@hpaulj if `col_b` were to be all list columns but of varying lengths, would vectorization be possible? — Karthik S, Nov 11 '22 at 16:28

score 1 · Answer 1 · answered Nov 11 '22 at 11:49

Since col_b not only has lists but also integers, you may need to use apply and treat them differently:

( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list 
                                             else x['col_a'] == x['col_b'], axis=1)

Output:

0    False
1     True
2     True
3    False
4     True
dtype: bool

Yevhen Kuzmovych · Accepted Answer · 2022-11-11T11:48:37.677

0

np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:

[
   1 in dt['col_b'],
   2 in dt['col_b'],
   5 in dt['col_b'],
   ...
]

There's no 5 in dt['col_b'] but there's 4

From the docs

isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.

Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:

def isin(row):
    if isinstance(row['col_b'], int):
        return row['col_a'] == row['col_b']
    else:
        return row['col_a'] in row['col_b']

dt.apply(isin, axis=1)

Output:

0    False
1     True
2     True
3    False
4     True
dtype: bool

edited Nov 11 '22 at 11:48

answered Nov 11 '22 at 11:32

Yevhen Kuzmovych

10,940
7
28
48

Ah ok, can you please let me know if there's a vectorized approach to what I am trying to accomplish – Karthik S Nov 11 '22 at 11:37
I think [this](https://stackoverflow.com/questions/40175327/numpy-element-wise-in-operation) answers your question – Yevhen Kuzmovych Nov 11 '22 at 11:39
I am getting this error : `ValueError: ('Shapes must match', (5,), (5, 1))`, can you please add that to the answer, I am very new to numpy – Karthik S Nov 11 '22 at 11:42
Is that vectorized operation? I need vectorized approach to make my code faster – Karthik S Nov 11 '22 at 11:52
If entire `col_b` was list column, would it possible to use any vectorized function? – Karthik S Nov 11 '22 at 16:25
@KarthikS Yes, see the link in my comment – Yevhen Kuzmovych Nov 11 '22 at 16:43
getting same `Shapes must match` error. Is it because in the link the `X` and `y` variables are fixed size and in my case the list length in `col_b` is varying ? – Karthik S Nov 11 '22 at 17:07

Why numpy .isin function gives incorrect output

2 Answers2