Finding entries containing a substring in a numpy array?

Question

I tried to find entries in an Array containing a substring with np.where and an in condition:

import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)

this only returns an empty Array.
Why is that so?
And is there a good alternative solution?

score 32 · Accepted Answer · edited Dec 03 '21 at 00:29

32

We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -

np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)

Sample run -

In [91]: bar
Out[91]: 
array(['aaa', 'aab', 'aca'], 
      dtype='|S3')

In [92]: foo
Out[92]: 'aa'

In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])

In [94]: bar[2] = 'jaa'

In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])

edited Dec 03 '21 at 00:29

mathfux

5,759
1
14
34

answered Aug 16 '16 at 11:54

Divakar

218,885
19
262
358

this works perfectly. Thank you very much! But out of curiosity do you know why the in condition in np.where doesnt work? – SiOx Aug 16 '16 at 12:09
@SiOx AFAIK `foo` being a NumPy array doesn't work with `in`. That `in` is meant for Python lists, etc. if that makes sense? – Divakar Aug 16 '16 at 12:28
1

`in` does work with an array, that is `ndarray` has a `__contains__` method. But behavior is similar to that of list. – hpaulj Aug 16 '16 at 16:13
3

`np.char.find` is the shorthand for this function. – hpaulj Aug 16 '16 at 16:15
this does only work partly when there is no space in there. if elements ' aaa' and ' aab' is used for the case above, (which has the space at the very front) it would not work – Isaac Sim Dec 26 '18 at 04:49
@IsaacSim Yeah, because that `space` is a character in itself. – Divakar Dec 26 '18 at 07:08
then if that case, how can I find ? – Isaac Sim Dec 26 '18 at 07:47
@IsaacSim Use `python strip function`? – Divakar Dec 26 '18 at 08:45
I already used before you gave me an advice, but thanks a lot ! I am just not happy that I have to do strip. that will also trying to find wrong answer because the substring can be found from the words concatenated = [ – Isaac Sim Dec 27 '18 at 01:48
@IsaacSim Maybe post a question on Stackoverflow, so that we all can have a better look at the issue. – Divakar Dec 27 '18 at 04:29

hpaulj · Answer 2 · 2016-08-16T16:15:08.417

Look at some examples of using in:

In [19]: bar = np.array(["aaa", "aab", "aca"])

In [20]: 'aa' in bar
Out[20]: False

In [21]: 'aaa' in bar
Out[21]: True

In [22]: 'aab' in bar
Out[22]: True

In [23]: 'aab' in list(bar)

It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.

But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.

As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.

In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0,  0, -1])

Docstring:
This module contains a set of functions for vectorized string operations and methods. The preferred alias for defchararray is numpy.char.

For operations like this I think the np.char speeds are about same as with:

In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)

In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)

Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.

score 4 · Answer 3 · answered Sep 28 '21 at 13:27

If using pandas is acceptable, then utilizing the str.contains method can be used.

import numpy as np
entries = np.array(["aaa", "aab", "aca"])

import pandas as pd
pd.Series(entries).str.contains('aa') # <----

Results in:

0     True
1     True
2    False
dtype: bool

The method also accepts regular expressions for more complex patterns:

pd.Series(entries).str.contains(r'a.a')

Results in:

0     True
1    False
2     True
dtype: bool

Chris Mueller · Answer 4 · 2017-03-26T15:43:14.773

The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.

foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)

The problem is that numpy does not define the in operator as an element-wise boolean operation.

One way you could accomplish what you want is with a list comprehension.

foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]

bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]

score 1 · Answer 5 · answered Aug 16 '18 at 23:07

1

You can also do something like this:

mask = [foo in x for x in bar]  
filter = bar[ np.where( mask * bar != '') ]

answered Aug 16 '18 at 23:07

htran

11
1

1

Hi and welcome to Stack Overflow! While this answer may solve the problem it does not try to answer the question of **why** the original code wasn't working. Could you please edit your question to explain this too? Thanks! – Max von Hippel Aug 16 '18 at 23:32

Finding entries containing a substring in a numpy array?

5 Answers5

Linked

Related