0

This has bugged me for some time now, but I haven't really found a satisfying solution.

If you declare a structured array with a field that contains strings, how can you set the dtype of that field to something so that you don't have to worry about the length of the strings in that field?

With floats and ints it is so much easier. So far, I've always used 'i4' or 'f4' as the respective dtypes and never had any issues (although I am not sure if this is bad practice, feel free to point it out). And in the unlikely case that numbers are actually too long for these dtypes, Python tells me so by raising an OverflowError. But if a string is too long, it is just silently cut off.

Is there any way to declare the string dtype so that you don't have to know exactly how long your strings are (going to be) that you want to store in the structured array prior to creating it? I mean you could always guesstimate and assume that, say, 'U30' is probably going to be enough and hope for the best, but I don't really like that. So far, my workaround has always been to use the object dtype 'O' because it just takes whatever, but I never really liked that either.

I think in the case of ints or floats, you could use ìnt and float as dtypes just as well, without having to worry about the number of bits necessary to store the data. Why is it not implemented in the same way for strings when using str as the dtype? I followed this chain of posts, and in the github issue, it is explained that the str dtype defaults to an empty string if I am not mistaken.

According to the numpy documentation on data type objects:

To use actual strings in Python 3 use U or np.unicode_.

So I thought I give a couple of things a try in the example below, but (as expected) none of them work.

import numpy as np


array = np.array(
    [
        ('Apple', 'green', 'round', 'fresh', 'good', 10e4, np.pi)], dtype=[
        ('fruit', np.str_), ('color', np.unicode_), ('shape', np.dtype(str)),
        ('state', str), ('taste', 'U2'), ('weight', 'i4'), ('radius', float)
    ]
)

# this causes OverflowError: Python int too large to convert to C long
# array[0]['weight'] = 10e10

# this is just 'ignored'
array[0]['color'] = 'red'

print(array)
mapf
  • 1,906
  • 1
  • 14
  • 40

1 Answers1

1

All the variants that you tried do the same thing, define a 'U0'. This isn't just a structured array issue.

dtype=[('fruit', '<U'), ('color', '<U'), ('shape', '<U'), ('state', '<U'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])

Either specify a longer length like 'U10' or 'O', object:

In [239]: arr = np.array( 
     ...:     [ 
     ...:         ('Apple', 'green', 'round', 'fresh', 'good', 10e4, np.pi)], dtype=[ 
     ...:         ('fruit', 'U10'), ('color', 'O'), ('shape', 'O'), 
     ...:         ('state', 'S10'), ('taste', 'U2'), ('weight', 'i4'), ('radius', float) 
     ...:     ] 
     ...: )                                                                                            
In [240]: arr                                                                                          
Out[240]: 
array([('Apple', 'green', 'round', b'fresh', 'go', 100000, 3.14159265)],
      dtype=[('fruit', '<U10'), ('color', 'O'), ('shape', 'O'), ('state', 'S10'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])
In [241]: arr['color']                                                                                 
Out[241]: array(['green'], dtype=object)
In [242]: arr['color']='yellow_green'                                                                  
In [243]: arr['fruit']                                                                                 
Out[243]: array(['Apple'], dtype='<U10')
In [244]: arr['fruit']='pineapple'                                                                     
In [245]: arr                                                                                          
Out[245]: 
array([('pineapple', 'yellow_green', 'round', b'fresh', 'go', 100000, 3.14159265)],
      dtype=[('fruit', '<U10'), ('color', 'O'), ('shape', 'O'), ('state', 'S10'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])

pandas opts for using object dtype for all of its strings. The numpy fixed string length is ok when the strings tend to be all the same size and know ahead of time, e.g. np.array(['one','two','three', 'four', 'five'])

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you! I forgot to mention though that when you create a normal array that contains only strings, numpy not only automatically assumes the correct dtype `str` but also chooses a length that fits all the characters. Why would that logic be abandoned when it comes to structured arrays? – mapf May 02 '20 at 08:55
  • Maybe in a structured array, the dtype is initialized first and then the data is stored accordingly, while in a regular array it is the opposite way? So in a structured array, the dtype dictates the data, whereas in a regular array the data dictates the dtype? – mapf May 02 '20 at 08:59
  • I didn't write the `np.array` code, so can't tell you about the `why`. One guess is that the default behavior has existed for a long time, and that the compound dtype case was a later addition, and they didn't try to copy everything. It makes sense to allocate the return array based on dimensions and dtype, and then fill it. Analyzing all the data first, as done when dtype is unspecified, could take more work and time. Keep in mind that a structured array requires a list of tuples, not a simple nested list. @mapf – hpaulj May 02 '20 at 18:38
  • Another way to look at this. Is specifying the compound dtype more like specifying `np.str` or 'U3'? Longer data strings don't expand the 'U3' specification, while the bare `str` is an open ended specification. – hpaulj May 02 '20 at 20:02
  • Thank you for your time and effort! Sorry for asking follow-up questions, but what does the fact that a structured array requires a list of tuples, not a simple nested list have to do with the dtypes? I am not sure I understand what you are saying in your second comment. You said yourself that even using the bare `str` itself as the dtype defaults to `'U0'`. Also I'm sorry but I don't know what the difference is between `np.str`, `str`, `U3` or `np.unicode` for that matter. I don't expect you to explain it to me though. I guess I should read it up at some point. – mapf May 02 '20 at 21:06
  • `str` is a python function; `np.str` and `np.str_` are numpy functions with the same docs (probably aliases). `np.unicode` is also a function. They all have the same effect when used as a `dtype` parameter. `np.dtype(str).dtype`, `np.dtype('U3').descr`, `np.dtype('str,U10').descr` etc may help show the differences, if any. – hpaulj May 02 '20 at 22:15