1

So I have two arrays that look like below:

x1 = np.array([['a','b','c'],['d','a','b'],['c','a,c','c']])
x2 = np.array(['d','c','d'])

I want to check if each element in x2 exists in a corresponding column in x1. So I tried:

print((x1==x2).any(axis=0))
#array([ True, False, False])

Note that x2[1] in x1[2,1] == True. The problem is, sometimes an element we're looking for is inside an element in x1 (where it can be identified if we split by comma). So my desired output is:

array([ True,  True, False])

Is there a way to do it using a numpy (or pandas) native method?

  • 1
    Does substring contains instead of `==` work? [Finding entries containing a substring in a numpy array?](https://stackoverflow.com/q/38974168/15497888). Like `(np.core.defchararray.find(x1, x2) != -1).any(axis=0)` Or does the comma need to be split into separate elements that need tested separately? – Henry Ecker Sep 19 '21 at 17:41
  • What do expect to happen with this string: `'a,c'` Is that a typo, of do you really want to consider that as two different characters? Because I would say neither `'a'` nor `'c'` exists in that column and you should try to clean your data up first. Also, why is your desired result for the third column `False` — it contains `'c'`, which is in `x2`. – Mark Sep 19 '21 at 17:42
  • @Mark, no it’s not a typo; I want to consider both ‘a’ and ‘c’ as two separate characters. –  Sep 19 '21 at 17:44
  • It makes little sense to me to have a string such as `'a,c'` in an array to represent the two separate characters `'a'` and `'c'`. I would suggest to have them as separate items in the array. If you run into array shape issues you could fill up the smaller arrays with `nan`s – Andre Sep 19 '21 at 18:23

1 Answers1

1

You can vectorize a function to broadcast x2 in x1.split(','):

@np.vectorize
def f(a, b):
    return b in a.split(',')

f(x1, x2).any(axis=0)
# array([ True,  True, False])

Note that "vectorize" is a misnomer. This isn't true vectorization, just a convenient way to broadcast a custom function.


Since you mentioned pandas in parentheses, another option is to apply a splitting/membership function to the columns of df = pd.DataFrame(x1).

However, the numpy function is significantly faster:

f(x1, x2).any(axis=0)         # 24.2 µs ± 2.8 µs
df.apply(list_comp).any()     # 913 µs ± 12.1 µs
df.apply(combine_in).any()    # 1.8 ms ± 104 µs
df.apply(expand_eq_any).any() # 3.28 ms ± 751 µs
# use a list comprehension to do the splitting and membership checking:
def list_comp(col):
    return [x2[col.name] in val.split(',') for val in col]
# split the whole column and use `combine` to check `x2 in x1`
def combine_in(col):
    return col.str.split(',').combine(x2[col.name], lambda a, b: b in a)
# split the column into expanded columns and check the expanded rows for matches
def expand_eq_any(col):
    return col.str.split(',', expand=True).eq(x2[col.name]).any(axis=1)
tdy
  • 36,675
  • 19
  • 86
  • 83