3

Python allows for a simple check if a string is contained in another string:

'ab' in 'abcd'

which evaluates to True.

Now take a numpy array of strings and you can do this:

import numpy as np
A0 = np.array(['z', 'u', 'w'],dtype=object)

A0[:,None] != A0

Resulting in a boolean array:

array([[False,  True,  True],
       [ True, False,  True],
       [ True,  True, False]], dtype=bool)

Lets now take another array:

A1 = np.array(['u_w', 'u_z', 'w_z'],dtype=object)

I want to check where a string of A0 is not contained in a string in A1, essentially creating unique combinations, but the following does not yield a boolean array, only a single boolean, regardless of how I write the indices:

A0[:,None] not in A1

I also tried using numpy.in1d and np.ndarray.__contains__ but those methods don't seem to do the trick either.

Performance is an issue here so I want to make full use of numpy's optimizations.

How do I achieve this?

EDIT:

I found it can be done like this:

fv = np.vectorize(lambda x,y: x not in y)
fv(A0[:,None],A1)

But as the numpy docs state:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

So this is the same as just looping over the array, and it would be nice to solve this without explicit or implicit for-loop.

Divakar
  • 218,885
  • 19
  • 262
  • 358
Khris
  • 3,132
  • 3
  • 34
  • 54
  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.core.defchararray.find.html#numpy.core.defchararray.find there is library of functions that apply string methods to the elements an array. – hpaulj May 18 '17 at 07:11

1 Answers1

2

We can convert to string dtype and then use one of those NumPy based string functions.

Thus, using np.char.count, one solution would be -

np.char.count(A1.astype(str),A0.astype(str)[:,None])==0

Alternative using np.char.find -

np.char.find(A1.astype(str),A0.astype(str)[:,None])==-1

One more using np.char.rfind -

np.char.rfind(A1.astype(str),A0.astype(str)[:,None])==-1

If we are converting one to str dtype, we can skip the conversion for the other array, as internally it would be done anyway. So, the last method could be simplified to -

np.char.rfind(A1.astype(str),A0[:,None])==-1

Sample run -

In [97]: A0
Out[97]: array(['z', 'u', 'w'], dtype=object)

In [98]: A1
Out[98]: array(['u_w', 'u_z', 'w_z', 'zz'], dtype=object)

In [99]: np.char.rfind(A1.astype(str),A0[:,None])==-1
Out[99]: 
array([[ True, False, False, False],
       [False, False,  True,  True],
       [False,  True, False,  True]], dtype=bool)

# Loopy solution using np.vectorize for verification
In [100]: fv = np.vectorize(lambda x,y: x not in y)

In [102]: fv(A0[:,None],A1)
Out[102]: 
array([[ True, False, False, False],
       [False, False,  True,  True],
       [False,  True, False,  True]], dtype=bool)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    Huh. The `sub` parameter allowing an array of strings isn't documented for any of those functions. Good to know that works. – Daniel F May 18 '17 at 07:22
  • @DanielF Yeah, even I didn't know until recently that those `string` funcs allow broadcasting when fed with arrays. – Divakar May 18 '17 at 07:25
  • That's a great solution, I'm stumbling over one more problem here though. After all this I need to join the strings and I'm trying out `np.core.defchararray.join` as described [here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.core.defchararray.join.html), but when e.g. doing `np.core.defchararray.join(['a','b'],['c','d'])` the output is `array(['c', 'd'], dtype='|S1')`. Can you tell me why? – Khris May 18 '17 at 07:31
  • @Khris What's the expected output there? – Divakar May 18 '17 at 07:36
  • The doc says "Calls str.join element-wise." and the method expects two sequences, so I'm expecting the output `array(['ac', 'bd']`, that would make the most sense to me. – Khris May 18 '17 at 07:39
  • 1
    @Khris Doesn't look like we can solve it with it, as its meant to join each character of each string element with the `sep`. For example : with `np.core.defchararray.join('-',['abcd','xyz'])`, we get `array(['a-b-c-d', 'x-y-z']`. I am guessing the vanilla Python code with a list comprehension might be the way here. – Divakar May 18 '17 at 08:17
  • Yeah, I guess that's the best way then. Thanks. – Khris May 18 '17 at 08:35