1

I have a numpy structured array a and create a view b on it:

import numpy as np
a = np.zeros(3, dtype={'names':['A','B','C'], 'formats':['int','int','float']})
b = a[['A', 'C']]

The descr component of the data type of b indicates that the data are stored somehow "scattered".

>>> b.dtype.descr
[('A', '<i4'), ('', '|V4'), ('C', '<f8')]

(After reading the documentation I believe that the component ('', '|V4') indicates a "gap" in the data, as b is just a view on a. )

If this bothers me, I can copy the data:

import numpy.lib.recfunctions as rf
c = rf.repack_fields(b)

and

>>> c.dtype.descr
[('A', '<i4'), ('C', '<f8')]

as desired.

This step requires me to copy the data. Now sometimes, I would like to apply an operation to the view. Often, these operations would return a copy of the array anyways. For example,

d = np.concatenate((b,b))

returns a copy of the data in b and a. Nonetheless,

>>> d.dtype.descr
[('A', '<i4'), ('', '|V4'), ('C', '<f8')]

indicates that the data are not stored efficiently.

So is there a way to work with views without producing "scattered" results? Would I always have to create a copy beforehand? Or is there no efficiency issue but only a weird way how descr describes the data type? (If so, how can I avoid that?)

This question becomes particularly relevent, if I want to neglect intermediate steps:

d = np.concatenate((a[['A', 'C']], a[['A', 'C']]))

I am working with numpy 1.16 and python 3.7.

Samufi
  • 2,465
  • 3
  • 19
  • 43

2 Answers2

2

Multifield indexing has been in a state of flux for sometime now. With 1.16 they seemed to have settled on this 'offset' form of 'views', requiring an explicit repacking if you want a 'clean' copy.

In [231]: np.__version__                                                             
Out[231]: '1.16.1'
In [232]: a.dtype                                                                    
Out[232]: dtype([('A', '<i8'), ('B', '<i8'), ('C', '<f8')])
In [233]: a[['A','C']].dtype                                                         
Out[233]: dtype({'names':['A','C'], 'formats':['<i8','<f8'], 'offsets':[0,16], 'itemsize':24})

In this view, the values for 'B' are still present (at offset 8). Think of the databuffer as having:

[a0, b0, c0, a1, b1, c1, a2, b2, c2, ....]

The [233] 'view' looks at the same databuffer, but only gives us access to the A and C fields. repack_fields creates a new databuffer with:

[a0, c0, a1, c1, ....]

If a had been a regular (n,3) array, a[:, [0,2]] would be a copy. We could not skip a[:,1] and still have a view.

In [234]: np.concatenate((a[['A','C']],a[['A','C']]))                                
Out[234]: 
array([(0, 0.), (1, 1.), (2, 2.), (0, 0.), (1, 1.), (2, 2.)],
      dtype={'names':['A','C'], 'formats':['<i8','<f8'], 'offsets':[0,16], 'itemsize':24})

Playing around with view I find that the field at offset 8 (the 'B' field in a) still exists, but is uninitialized (as in a np.empty array).

Different ways of displaying this 'scattered' dtype:

In [238]: a1.dtype                                                                   
Out[238]: dtype({'names':['A','C'], 'formats':['<i8','<f8'], 'offsets':[0,16], 'itemsize':24})

In [239]: a1.dtype.descr                                                             
Out[239]: [('A', '<i8'), ('', '|V8'), ('C', '<f8')]

In [241]: a1.dtype.fields                                                            
Out[241]: mappingproxy({'A': (dtype('int64'), 0), 'C': (dtype('float64'), 16)})

I can reorder the fields as well:

In [248]: a[['B','C','A']].dtype                                                     
Out[248]: dtype({'names':['B','C','A'], 'formats':['<i8','<f8','<i8'], 'offsets':[8,16,0], 'itemsize':24})
In [249]: a[['B','C','A']].dtype.descr                                               
...
ValueError: dtype.descr is not defined for types with overlapping or out-of-order fields
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks, you are describing well what I have sloppily called "scattered data", and you show nocely that the issue actually exist (which was one of my questions). Do you also have an idea how I could avoid ending up with such a "scattered data type" (without copying everything again)? If there is no builtin way to do that, this is an answer as well. Then I would be interested in the most elegant workaround. – Samufi Apr 17 '19 at 03:29
  • The whole point with the 1.16 changes is that multifield indexing produces a view, not a copy. The repack function is provided if you want the older copy behavior. The field copy that you did in the other question utilizes the new view behavior. – hpaulj Apr 17 '19 at 17:41
0

For concatenate only, you can simply do:

a     = np.array([(1,2,3),(4,5,6)], 'f,f,f')
view  = a[['f0','f2']]

b     = np.empty(4, 'f,f')
b[:2] = view
b[2:] = view

print(b)

output:

array([(1., 3.), (4., 6.), (1., 3.), (4., 6.)],
      dtype=[('f0', '<f4'), ('f1', '<f4')])

EDIT: Forget about what I said about np.add, it isn't supposed to work anyway

ZisIsNotZis
  • 1,570
  • 1
  • 13
  • 30
  • Thanks for your answer. I would indeed be interested in a more general answer, though. I am aware that I can build workarounds. I just thought that the numpy developers have thought of this problem and introduced a nice way to deal with it... – Samufi Apr 17 '19 at 17:46