My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys
vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays
was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin
, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a
and col_b
.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.