1

Consider a numpy array of array of strings (at least my closest take on how to do that):

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
print(ff.dtype)
<U4

But these can not be used with the numpy.char methods apparently .. ?

ffc = ff.astype('S5')
fff = np.char.split(ffc,':')[1]


Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "/usr/local/lib/python3.7/site-packages/numpy/core/defchararray.py", line 1447, in split
    a, object_, 'split', [sep] + _clean_args(maxsplit))
TypeError: a bytes-like object is required, not 'numpy.str_'

What is the difference between type <U4 and .str_ and how can the strings shown be parsed by np.char.** ?

kmario23
  • 57,311
  • 13
  • 161
  • 150
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • If the dtype is `U` use unicode parameters. If a bytestring dtype `S` use a bytestring parameter, `b':'`. – hpaulj Apr 28 '19 at 14:19

2 Answers2

1

First, the np.char functions are meant to work on chararrays, which should be constructed with np.char.array or np.char.asarray (see the docs).

Accordingly, your given code would work like this:

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
ffc = np.char.asarray(ff)
fff = np.char.split(ffc, ':')[1]

print(fff)

Output:

[list(['g', 'hi']) list(['j', 'kl'])]

This conversion is implicitly performed, so this, in fact, would also work:

ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])
fff = np.char.split(ff, ':')[1]

For completeness, your subsidiary question about <U4 vs S5:

A numpy dtype with U signifies a unicode string, which is the recommended way of representing strings. On the other hand, S represents a null-terminated byte array.

My suspicion is that the string methods are performed upon Python objects, and therefore you need a Python string-like type (knows its own length, etc.) rather than a "dumb" C string-like byte array.

gmds
  • 19,325
  • 4
  • 32
  • 58
1

The string type in the parameter must match the type in the array:

In [44]: ff = np.array([['a:bc','d:ef'],['g:hi','j:kl']])                            
In [45]: ff                                                                          
Out[45]: 
array([['a:bc', 'd:ef'],
       ['g:hi', 'j:kl']], dtype='<U4')
In [46]: np.char.split(ff,':')                                                       
Out[46]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl'])]], dtype=object)
In [47]: np.char.split(ff.astype('S5'),b':')                                         
Out[47]: 
array([[list([b'a', b'bc']), list([b'd', b'ef'])],
       [list([b'g', b'hi']), list([b'j', b'kl'])]], dtype=object)

'U4' is unicode, the default string type for Py3. 'S4' is bytestring, the default type for Py2. b':' is a bytestring, u':' is unicode.

This np.char.split is a bit awkward to use, since the result is object dtype, with lists of the split strings.

To get 2 separate arrays I'd use frompyfunc to apply an unpacking:

In [50]: np.frompyfunc(lambda alist: tuple(alist), 1,2)(_46)                         
Out[50]: 
(array([['a', 'd'],
        ['g', 'j']], dtype=object), array([['bc', 'ef'],
        ['hi', 'kl']], dtype=object))
In [51]: np.frompyfunc(lambda alist: tuple(alist), 1,2)(_47)                         
Out[51]: 
(array([[b'a', b'd'],
        [b'g', b'j']], dtype=object), array([[b'bc', b'ef'],
        [b'hi', b'kl']], dtype=object))

though to get string dtype arrays I'd still have use astype:

In [52]: _50[0].astype('U4')                                                         
Out[52]: 
array([['a', 'd'],
       ['g', 'j']], dtype='<U4')

I could combine the unpacking and astype with np.vectorize by providing otypes (even a mix of dtypes!):

In [53]: np.vectorize(lambda alist:tuple(alist), otypes=['U4','S4'])(_46)            
Out[53]: 
(array([['a', 'd'],
        ['g', 'j']], dtype='<U1'), array([[b'bc', b'ef'],
        [b'hi', b'kl']], dtype='|S2'))

Usually frompyfunc is faster than vectorize.

This unpacking won't work if the split creates different length lists:

In [54]: ff = np.array([['a:bc','d:ef'],['g:hi','j:kl:xyz']])                        
In [55]: np.char.split(ff,':')                                                       
Out[55]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl', 'xyz'])]], dtype=object)

===

With a chararray, all these np.char functions are available as methods.

In [59]: np.char.asarray(ff)                                                         
Out[59]: 
chararray([['a:bc', 'd:ef'],
           ['g:hi', 'j:kl:xyz']], dtype='<U8')
In [60]: np.char.asarray(ff).split(':')                                              
Out[60]: 
array([[list(['a', 'bc']), list(['d', 'ef'])],
       [list(['g', 'hi']), list(['j', 'kl', 'xyz'])]], dtype=object)

See the note in the np.char docs:

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

hpaulj
  • 221,503
  • 14
  • 230
  • 353