What numpy structures are expected as inputs to use numpy.char functions?

Question

Consider a numpy array of array of strings (at least my closest take on how to do that):

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
print(ff.dtype)
<U4

But these can not be used with the numpy.char methods apparently .. ?

ffc = ff.astype('S5')
fff = np.char.split(ffc,':')[1]


Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "/usr/local/lib/python3.7/site-packages/numpy/core/defchararray.py", line 1447, in split
    a, object_, 'split', [sep] + _clean_args(maxsplit))
TypeError: a bytes-like object is required, not 'numpy.str_'

What is the difference between type <U4 and .str_ and how can the strings shown be parsed by np.char.** ?

If the dtype is `U` use unicode parameters. If a bytestring dtype `S` use a bytestring parameter, `b':'`. — hpaulj, Apr 28 '19 at 14:19

score 1 · Accepted Answer · answered Apr 28 '19 at 12:47

1

First, the np.char functions are meant to work on chararrays, which should be constructed with np.char.array or np.char.asarray (see the docs).

Accordingly, your given code would work like this:

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
ffc = np.char.asarray(ff)
fff = np.char.split(ffc, ':')[1]

print(fff)

Output:

[list(['g', 'hi']) list(['j', 'kl'])]

This conversion is implicitly performed, so this, in fact, would also work:

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
fff = np.char.split(ff, ':')[1]

For completeness, your subsidiary question about <U4 vs S5:

A numpy dtype with U signifies a unicode string, which is the recommended way of representing strings. On the other hand, S represents a null-terminated byte array.

My suspicion is that the string methods are performed upon Python objects, and therefore you need a Python string-like type (knows its own length, etc.) rather than a "dumb" C string-like byte array.

answered Apr 28 '19 at 12:47

gmds

19,325
4
32
58

Hmm, I thought I had _started_ with using the `np.array` directly (without invoking the `astype('S5')` and had received the same error as shown in the OP. But the code (as shown in your third snippet) does work. – WestCoastProjects Apr 28 '19 at 12:52
@javadba Could be a `numpy` version issue, or a stale variable. Would you happen to have been working in a Jupyter notebook or equivalent? – gmds Apr 28 '19 at 12:55
No i'm using `pycharm` – WestCoastProjects Apr 28 '19 at 12:56
See the note in `np.char` about the current use of `np.chararrays`. – hpaulj Apr 28 '19 at 15:21

hpaulj · Answer 2 · 2019-04-28T15:18:52.873

The string type in the parameter must match the type in the array:

In [44]: ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])                            
In [45]: ff                                                                          
Out[45]: 
array([['a:bc', 'd:ef'],
       ['g:hi', 'j:kl']], dtype='<U4')
In [46]: np.char.split(ff,':')                                                       
Out[46]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl'])]], dtype=object)
In [47]: np.char.split(ff.astype('S5'),b':')                                         
Out[47]: 
array([[list([b'a', b'bc']), list([b'd', b'ef'])],
       [list([b'g', b'hi']), list([b'j', b'kl'])]], dtype=object)

'U4' is unicode, the default string type for Py3. 'S4' is bytestring, the default type for Py2. b':' is a bytestring, u':' is unicode.

This np.char.split is a bit awkward to use, since the result is object dtype, with lists of the split strings.

To get 2 separate arrays I'd use frompyfunc to apply an unpacking:

In [50]: np.frompyfunc(lambda alist: tuple(alist), 1,2)(_46)                         
Out[50]: 
(array([['a', 'd'],
        ['g', 'j']], dtype=object), array([['bc', 'ef'],
        ['hi', 'kl']], dtype=object))
In [51]: np.frompyfunc(lambda alist: tuple(alist), 1,2)(_47)                         
Out[51]: 
(array([[b'a', b'd'],
        [b'g', b'j']], dtype=object), array([[b'bc', b'ef'],
        [b'hi', b'kl']], dtype=object))

though to get string dtype arrays I'd still have use astype:

In [52]: _50[0].astype('U4')                                                         
Out[52]: 
array([['a', 'd'],
       ['g', 'j']], dtype='<U4')

I could combine the unpacking and astype with np.vectorize by providing otypes (even a mix of dtypes!):

In [53]: np.vectorize(lambda alist:tuple(alist), otypes=['U4','S4'])(_46)            
Out[53]: 
(array([['a', 'd'],
        ['g', 'j']], dtype='<U1'), array([[b'bc', b'ef'],
        [b'hi', b'kl']], dtype='|S2'))

Usually frompyfunc is faster than vectorize.

This unpacking won't work if the split creates different length lists:

In [54]: ff = np.array([['a:bc','d:ef'],['g:hi','j:kl:xyz']])                        
In [55]: np.char.split(ff,':')                                                       
Out[55]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl', 'xyz'])]], dtype=object)

===

With a chararray, all these np.char functions are available as methods.

In [59]: np.char.asarray(ff)                                                         
Out[59]: 
chararray([['a:bc', 'd:ef'],
           ['g:hi', 'j:kl:xyz']], dtype='<U8')
In [60]: np.char.asarray(ff).split(':')                                              
Out[60]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl', 'xyz'])]], dtype=object)

See the note in the np.char docs:

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

Heavy duty treatise. I'll come back to this when not under the gun to properly absorb — WestCoastProjects, Apr 28 '19 at 17:31

What numpy structures are expected as inputs to use numpy.char functions?

2 Answers2