2

I am using numpy 1.16.2.

In brief, I am wondering how to add an object-type field to a structured array. The standard way via the recfunctions module throws an error and I suppose there is a reason for this. Therefore, I wonder whether there is anything wrong with my workaround. Furthermore, I would like to understand why this workaround is necessary and whether I need to use extra caution when accessing the newly created array.

Now here come the details:

I have a numpy structured array:

import numpy as np
a = np.zeros(3, dtype={'names':['A','B','C'], 'formats':['int','int','float']})
for i in range(len(a)):
    a[i] = i

I want to add another field "test" of type object to the array a. The standard way for doing this is using numpy's recfunctions module:

import numpy.lib.recfunctions as rf
b = rf.append_fields(a, "test", [None]*len(a)) 

This code throws an error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-4a7be4f94686> in <module>
----> 1 rf.append_fields(a, "test", [None]*len(a))

D:\_Programme\Anaconda3\lib\site-packages\numpy\lib\recfunctions.py in append_fields(base, names, data, dtypes, fill_value, usemask, asrecarray)
    718     if dtypes is None:
    719         data = [np.array(a, copy=False, subok=True) for a in data]
--> 720         data = [a.view([(name, a.dtype)]) for (name, a) in zip(names, data)]
    721     else:
    722         if not isinstance(dtypes, (tuple, list)):

D:\_Programme\Anaconda3\lib\site-packages\numpy\lib\recfunctions.py in <listcomp>(.0)
    718     if dtypes is None:
    719         data = [np.array(a, copy=False, subok=True) for a in data]
--> 720         data = [a.view([(name, a.dtype)]) for (name, a) in zip(names, data)]
    721     else:
    722         if not isinstance(dtypes, (tuple, list)):

D:\_Programme\Anaconda3\lib\site-packages\numpy\core\_internal.py in _view_is_safe(oldtype, newtype)
    492 
    493     if newtype.hasobject or oldtype.hasobject:
--> 494         raise TypeError("Cannot change data-type for object array.")
    495     return
    496 

TypeError: Cannot change data-type for object array.

A similar error has been discussed here, though the issue is old and I do not know whether the behaviour I am observing is actually a bug. Here I am informed that views of structured arrays containing general objects are not supported.

I therefore built a workaround:

b = np.empty(len(a), dtype=a.dtype.descr+[("test", object)])
b[list(a.dtype.names)] = a

This works. Nonetheless, I have the following questions:

Questions

  • Why is this workaround neccesary? Is this just a bug?
  • Working with the new array b seems to be no different from working with a. The variable c = b[["A", "test"]] is clearly a view to the data of b. So why would they say that views on the array b are not supported? Do I have to treat c with extra caution?
Samufi
  • 2,465
  • 3
  • 19
  • 43
  • 1
    I don't regard the `recfunctions` as *standard*. They are utilities that can be used, but you don't have to use them. They aren't compiled, and thus don't do anything that you can't do without them. – hpaulj Apr 17 '19 at 00:41
  • https://stackoverflow.com/questions/42364725/numpy-recarray-append-fields-cant-append-numpy-array-of-datetimes - basically the same issue, trying to use `append_fields` with an object dtype. An alternative there was to append a `datetime64` field instead. – hpaulj May 01 '19 at 00:37

1 Answers1

3
In [161]: a = np.zeros(3, dtype={'names':['A','B','C'], 'formats':['int','int','
     ...: float']}) 
     ...: for i in range(len(a)): 
     ...:     a[i] = i 
     ...:                                                                       
In [162]: a                                                                     
Out[162]: 
array([(0, 0, 0.), (1, 1, 1.), (2, 2, 2.)],
      dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<f8')])

define the new dtype:

In [164]: a.dtype.descr                                                         
Out[164]: [('A', '<i8'), ('B', '<i8'), ('C', '<f8')]
In [165]: a.dtype.descr+[('test','O')]                                          
Out[165]: [('A', '<i8'), ('B', '<i8'), ('C', '<f8'), ('test', 'O')]
In [166]: dt= a.dtype.descr+[('test','O')]                                      

new array of right size and dtype:

In [167]: b = np.empty(a.shape, dt)                                             

copy values from a to b by field name:

In [168]: for name in a.dtype.names: 
     ...:     b[name] = a[name] 
     ...:                                                                       
In [169]: b                                                                     
Out[169]: 
array([(0, 0, 0., None), (1, 1, 1., None), (2, 2, 2., None)],
      dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<f8'), ('test', 'O')])

Many of the rf functions do this field by field copy:

rf.recursive_fill_fields(a,b)

rf.append_fields uses this after it initializes it's output array.

In earlier versions a multifield index produced a copy, so expressions like b[list(a.dtype.names)] = a would not work.


I don't know if it's worth trying to figure out what rf.append_fields is doing. Those functions are somewhat old, and not heavily used (note the special import). So it's entirely likely that they have bugs, or edge cases , that don't work. The functions that I've examined function much as I demonstrated - make a new dtype, and result array, and copy data by field name.

In recent releases there have been changes in how multiple fields are accessed. There are some new functions in recfunctions to facilitate working with structured arrays, such as repack_fields.

https://docs.scipy.org/doc/numpy/user/basics.rec.html#accessing-multiple-fields

I don't know if any of that applies to the append_fields problem. I see there's also a section about structured arrays with objects, but I haven't studied that:

https://docs.scipy.org/doc/numpy/user/basics.rec.html#viewing-structured-arrays-containing-objects

In order to prevent clobbering object pointers in fields of numpy.object type, numpy currently does not allow views of structured arrays containing objects.

This line apparently refers to the use of view method. Views created by field indexing, whether single name or multifield lists, are not affected.


The error in append_fields comes from this operation:

In [183]: data = np.array([None,None,None])                                          
In [184]: data                                                                       
Out[184]: array([None, None, None], dtype=object)
In [185]: data.view([('test',object)])                                               
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-185-c46c4464b53c> in <module>
----> 1 data.view([('test',object)])

/usr/local/lib/python3.6/dist-packages/numpy/core/_internal.py in _view_is_safe(oldtype, newtype)
    492 
    493     if newtype.hasobject or oldtype.hasobject:
--> 494         raise TypeError("Cannot change data-type for object array.")
    495     return
    496 

TypeError: Cannot change data-type for object array.

There's no problem creating an compound dtype with object dtypes:

In [186]: np.array([None,None,None], dtype=[('test',object)])                        
Out[186]: array([(None,), (None,), (None,)], dtype=[('test', 'O')])

But I don't see any recfunctions that are capable of joining a and data.


view can be used to change the field names of a:

In [219]: a.view([('AA',int),('BB',int),('cc',float)])                               
Out[219]: 
array([(0, 0, 0.), (1, 1, 1.), (2, 2, 2.)],
      dtype=[('AA', '<i8'), ('BB', '<i8'), ('cc', '<f8')])

but trying to do so for b fails for the same reason:

In [220]: b.view([('AA',int),('BB',int),('cc',float),('d',object)])                  
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-220-ab0a6e4dd57f> in <module>
----> 1 b.view([('AA',int),('BB',int),('cc',float),('d',object)])

/usr/local/lib/python3.6/dist-packages/numpy/core/_internal.py in _view_is_safe(oldtype, newtype)
    492 
    493     if newtype.hasobject or oldtype.hasobject:
--> 494         raise TypeError("Cannot change data-type for object array.")
    495     return
    496 

TypeError: Cannot change data-type for object array.

I start with a object dtype array, and try to view with i8 (same size dtype), I get this same error. So the restriction on view of a object dtype isn't limited to structured arrays. The need for such a restriction in the case of object pointer to i8 makes sense. The need for such a restriction in the case of embedding the object pointer in a compound dtype might not be so compelling. It might even be overkill, or just a case of simply playing it safe and simple.

In [267]: x.dtype                                                                    
Out[267]: dtype('O')
In [268]: x.shape                                                                    
Out[268]: (3,)
In [269]: x.dtype.itemsize                                                           
Out[269]: 8
In [270]: x.view('i8')                                                               
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-270-30c78b13cd10> in <module>
----> 1 x.view('i8')

/usr/local/lib/python3.6/dist-packages/numpy/core/_internal.py in _view_is_safe(oldtype, newtype)
    492 
    493     if newtype.hasobject or oldtype.hasobject:
--> 494         raise TypeError("Cannot change data-type for object array.")
    495     return
    496 

TypeError: Cannot change data-type for object array.

Note that the test in line 493 checks the hasobject property of both the new and old dtypes. A more nuanced test might check if both hasobject, but I suspect the logic could get quite complex. Sometimes a simple prohibition is safer (and easier) a complex set of tests.


In further testing

In [283]: rf.structured_to_unstructured(a)                                           
Out[283]: 
array([[ 3.,  3.,  0.],
       [12., 10.,  1.],
       [ 2.,  2.,  2.]])

but trying to do the same on b, or even a subset of its fields produces the familiar error:

rf.structured_to_unstructured(b)
rf.structured_to_unstructured(b[['A','B','C']]) 

I have to first use repack to make a object-less copy:

rf.structured_to_unstructured(rf.repack_fields(b[['A','B','C']])) 
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for your answer. Could you point out how your solution is different from my initial one? The recent changes to numpy are exactly why I have to ask this question. Would you mind addressing the other questions I have asked? So far, the only information new to me is that the `recfunctions` are old and non-standard. – Samufi Apr 17 '19 at 01:02
  • May not be any real difference between your solution and mine. I just like to think through the problem myself. – hpaulj Apr 17 '19 at 01:27
  • Thanks for updating your answer. I know where the error is raised; I put the error message is in my question. (I really appreciate your efforts, but have you actually read the question?) Would the resulting array behave differently than standard arrays (due to the mentioned "view" issue)? – Samufi Apr 17 '19 at 01:33
  • Are you seeking some deeper answer? `b.view(b.dtype)` seems to be the only `.view` expression that works. With `a.view(...)` I can change field names. I can't do that with `b`. It raises the same `_internal` error. The `views` created by field indexing of `b` are a different matter. – hpaulj Apr 17 '19 at 01:44
  • Yes, a deeper answer would be great! This is an interesting observation you are mentioning. Is that because `b` contains a column of type `object`? – Samufi Apr 17 '19 at 01:47
  • I just added a demonstration of how trying to do a view change of a simple object dtype produces the same error. – hpaulj Apr 17 '19 at 04:20
  • 1
    Your use of `b[list(a.dtype.names)] = a` depends on the latest version, where a multifield index produces a view. Previous versions made a copy. I iterated on field names, which worked in the earlier code. See `rf.recursive_fill_fields(a,b)` – hpaulj Apr 17 '19 at 15:21