PYTHON/NUMPY: Handling structured arrays compared to common numpy-arrays Python2.7

Question

The main reason why I am asking this question is because I do not exactly know how the structured arrays work compared to normal arrays and because I could not find suitable examples online for my case. Further, I am probably filling my structured array wrongly in the first place.

So, here I want to present the 'normal' numpy-array version (and what I need to do with it) and the new 'structured' array version. My (largest) datasets contain around 200e6 objects/rows with up to 40-50 properties/columns. They all have the same data type except for a few special columns: 'haloid', 'hostid', 'type'. They are ID numbers or flags and I have to keep them with the rest of the data because I have to identify my objects with them.

data set name:

data_array: ndarray shape: (42648, 10)

data type:

dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), 
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), 
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]

Reading data from .hdf5-file format to array

Most of the data is stored in hdf5-files (2000 of them corresponds to one snapshot I have to process at once) which should be read-in to a single array

import numpy as np
import h5py as hdf5

mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names
nr_rows     = 200000                               # approximated
nr_files    = 100                                  # up to 2200
nr_entries  = 10                                   # up to 50   
size        = 0
size_before = 0
new_size    = 0

# normal array:
data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64)
# structured array:
data_array=np.zeros((nr_rows,), dtype=dt)

i=0
while i<nr_files:
    size_before=new_size

    f = hdf5.File(path, "r")
    size=f[mydict['name0']].size

    new_size+=size                

    a=0
    while a<nr_entries:
        name=mydict['name'+str(a)]
        # normal array: 
        data_array[size_before:new_size, a] = f[name] 
        # structured array:
        data_array[name][size_before:new_size] = f[name]                 
        a+=1                
    i+=1

EDIT: I edit the code above because hpaulj was fortunately commenting the following:

First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But the h5 load is data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names.

This was a 'I-simplify-code' copy/paste mistake and I corrected it!

Question 1: Is that the right way to fill a structured array?

data_array[name][size_before:new_size] = f[name]

Question 2: How to address a column in a structured array?

data_array[name] #--> column with a certain name

Question 3: How to address an entire row in a structured array?

data_array[0] #--> first row

Question 4: How to address 3 rows and all columns?

# normal array:
print data_array[0:3,:]
[[  1.21080866e+10   1.21080866e+10   0.00000000e+00   5.69363234e+08
    1.28992369e+03   1.28894614e+03   1.32171442e+03  -1.08210000e+02
    4.92900000e+02   6.50400000e+01]
 [  1.21080711e+10   1.21080711e+10   0.00000000e+00   4.76329837e+06
    1.29058079e+03   1.28741361e+03   1.32358059e+03  -4.23130000e+02
    5.08720000e+02  -6.74800000e+01]
 [  1.21080700e+10   1.21080700e+10   0.00000000e+00   2.22978043e+10
    1.28750287e+03   1.28864306e+03   1.32270418e+03  -6.13760000e+02
    2.19530000e+02  -2.28980000e+02]]

# structured array:    
print data_array[0:3]
#it returns a lot of data ...
[[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04)
  (12108071103L, 12108071103L, 0, 0.0, ... more data ...
  ... 228.02) ... more data ...
  (8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]]

Question 5: Why does data_array[0:3] not only return the first 3 rows with the 10 columns?

Question 6: How to address the first two elements in the first column?

# normal array:
print data_array[0:1,0]
[  1.21080866e+10   1.21080711e+10]
# structured array:  
print data_array['haloid']][0][0:1]  
[12108086595 12108071103]

OK! I got that!

Question 7: How to address three specific columns by name and they first 3 rows in that columns?

# normal array: 
print data_array[0:3, [0,2,1]]
[[  1.21080866e+10   0.00000000e+00   1.21080866e+10]
 [  1.21080711e+10   0.00000000e+00   1.21080711e+10]
 [  1.21080700e+10   0.00000000e+00   1.21080700e+10]]

# structured array:  
print data_array[['haloid','type','hostid']][0][0:3]  
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L)
 (12108069992L, 0, 12108069992L)]

OK, the last example seems to work!!!

Question 8: What is the difference between:

(a) data_array['haloid'][0][0:3] and (b) data_array['haloid'][0:3]

where (a) returns really the first three haloids and (b) returns a lot of haloids (10x3).

[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340
   9248632230 12108066342 10878169355 10077026070]
 [ 6093565531 10077025463  8046772253  7871669276  5558161476  5558161473
  12108068704 12108068708 12108077435 12108066338]
 [ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312
  12108075900  9248643751  6630111058 12108074389]]

Question 9: What is data_array['haloid'][0:3] actually returning?

Question 10: How to mask a structured array with np.where()

# NOTE: col0,1,2 are some integer values of the column I want to address 
# col_name0,1,2 are corresponding names e.g. mstar, type, haloid

# normal array
mask = np.where(data[:,col2] > data[:,col1])
data[mask[:][0]]

mask = np.where(data[:,col2]==2)
data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]]

#structured array
mask = np.where(data['x_pos'][0] > data['y_pos'][0]])
data[mask[:][0]]

mask = np.where(data[:,col2]==2)
data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]]

This seems to work, but I am not sure!

Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?

Question 12: How to sort a structured array?

# normal array: 
data_sorted = data[np.argsort(data[:,col2])]
# structured array: 
data_sorted = data[np.argsort(data['mstar'][:,col3])]

Thanks, I appreciate any help or advise!

X[0][0:3] returns first three columns from row zero. X[0:3] returns all columns from the first three rows — valentin, Mar 09 '17 at 14:12
@valentin: Hi, thanks for commenting! I am not sure, when I am printing data_array['haloid'][0:3] I get (3,10) but all columns but with only 'haloid's in return (no idea where they are coming from!). Please see again the question, I added the output for data_array['haloid'][0:3]. — firefly2517, Mar 09 '17 at 14:20
I've only skimmed your question, but I wonder -why can't the id values be kept separate arrays of the same length? It is just as easy to search a 1d array as a field. Would it help to experiment with a small example, and without the `h5py` part? If I write answer it will be illustrated with a very small example. — hpaulj, Mar 09 '17 at 14:58
@hpaulj: I hoped that you would see the question, you seem to be the structured-array guru here ;-). So, the combination of haloid and hostid is important. There are cases where an object A has the same hostid as an object B haloid. So they are linked. I have to track the haloid of B down and set e.g. the position: x,y,z of A to the position of x,y,z of B! But before I am arg-sorting them lets say by the 'mstar'. So I need to keep track of the haloid which has to be in the same structure as mstar and the positions. However, I could be that I am imagine the problem much bigger than it is. — firefly2517, Mar 09 '17 at 15:12

hpaulj · Accepted Answer · 2017-03-09T20:15:52.857

First point of confusion. You show a dt definition with names like dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),.... But the h5 load is

data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)]

In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names.

You can iterate of the fields of an array defined with dt using

for name in dt.names:
    data[name] = ...

e.g.

In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), 
    ...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), 
    ...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
In [21]: arr = np.zeros((3,), dtype=dt)
In [22]: arr
Out[22]: 
array([(0, 0, 0,  0.,  0.,  0.,  0.,  0.,  0.,  0.),
       (0, 0, 0,  0.,  0.,  0.,  0.,  0.,  0.,  0.),
       (0, 0, 0,  0.,  0.,  0.,  0.,  0.,  0.,  0.)], 
      dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [23]: for name in arr.dtype.names:
    ...:     print(name)
    ...:     arr[name] = 1
    ...:     
haloid
hostid
 ....
In [24]: arr
Out[24]: 
array([(1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.),
       (1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.),
       (1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.)], 
      dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [25]: arr[0]     # get one record
Out[25]: (1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.)
In [26]: arr[0]['hostid']     # get one field, one record
In [27]: arr['hostid']       # get all values of a field
Out[27]: array([1, 1, 1], dtype=uint64)
In [28]: arr['hostid'][:2]    # subset of records
Out[28]: array([1, 1], dtype=uint64)

So filling a structured array by field name should work fine:

arr[name][n1:n2] = file[dataset_name]

Prints like this:

structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]

and

[[ (12108086595L, 12108086595L, 0,

look to me like the structured data_array is actually 2d, created with something like (see Question 8)

data_array = np.zeros((10, nr_rows), dtype=dt)

That's the only way that the [0][0:3] indexing would work,

For the 2d array:

mask = np.where(data[:,col2] > data[:,col1])

compares 2 columns. When in doubt look first as the boolean data[:,col2] > data[:,col1]. where just returns the indices where that boolean array is True.

Simple example of masked indexing:

In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T
In [33]: mask = x[:,0]>x[:,1]
In [34]: mask
Out[34]: array([False, False, False,  True,  True,  True], dtype=bool)
In [35]: idx = np.where(mask)
In [36]: idx
Out[36]: (array([3, 4, 5], dtype=int32),)
In [37]: x[mask,:]
Out[37]: 
array([[3, 2],
       [4, 1],
       [5, 0]])
In [38]: x[idx,:]
Out[38]: 
array([[[3, 2],
        [4, 1],
        [5, 0]]])

In this structured example, data['x_pos'] selects the field. the [0] is required to select the 1st row of that 2d array (the size 10 dimension). The rest of the comparison and where should work as with a 2d array.

mask = np.where(data['x_pos'][0] > data['y_pos'][0]])

mask[:][0] on a where tuple is probably not needed. mask is a tuple, [:] makes a copy and [0] selects the 1st element, which is an array. Sometimes a arr[idx[0],:] might be needed instead of arr[idx,:], but don't do that routinely.

My first comment suggested separate arrays

 dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')]
 data_id = np.zeros((n,), dtype=dt1)

 data = np.zeros((n,m), dtype=float)    # m float columns

Or even

 haloid = np.zeros((n,), '<u8')
 hostid = np.zeros((n,), '<u8')
 type = np.zeros((n,), 'i1')

With these arrays, data_array['hostid'][0], data_id['hostid'] and hostid should all return the same 1d array, and be equally usable in mask expressions.

Sometimes it is convenient to keep the ids and data in one structure. That's especially true if writing/reading to csv format files. But for masked selection it doesn't help that much. And for data calculations across data fields it can be a pain.

I could also suggest a compound dtype, one with

dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))]

In [41]: np.zeros((4,), dtype=dt2)
Out[41]: 
array([(0, 0, 0, [ 0.,  0.,  0.]), (0, 0, 0, [ 0.,  0.,  0.]),
       (0, 0, 0, [ 0.,  0.,  0.]), (0, 0, 0, [ 0.,  0.,  0.])], 
      dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))])
In [42]: _['data']
Out[42]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

Is it better to access the float data by column number or by a name like 'x_coor'? Do you need to do calculations with several float columns at once, or will you always access them individually?

Thanks a lot! Amazing, you really saved me, now I start to apply your comments to my programme! — firefly2517, Mar 10 '17 at 10:39
Your comment "In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names." This is a typo, it should only say 'name' as you commented. I will edit that in the question! — firefly2517, Mar 10 '17 at 10:42

score 0 · Answer 2 · answered Mar 09 '17 at 14:03

0

Through your description I think the naive way is to read in only useful data into arrays with different names (one type each maybe?) If you want all the data read into one array, maybe Pandas is your choice: http://pandas.pydata.org http://pandas.pydata.org/pandas-docs/stable/ But I haven't tried that yet. Have fun to give a try.

answered Mar 09 '17 at 14:03

frog

1

Hi, thanks for commenting: I am only reading the data in which I am needing.I have to filtered it and need store it together in a more handy sub-file. However, the ID number is crucial and I have to store it with the rest of the data ... – firefly2517 Mar 09 '17 at 14:11
There is not need to save the 1D number with the other arrays. Sorted array or slices can be applied to the different arrays Such as this: IDs=halo IDs; arrays=your the arrays; sorted_a1=np.argsort(arrays[:,1]) IDs[sorted_a1] has the same order as arrays[sorted_a1,1] The same applies to slices by np.where() – frog Mar 09 '17 at 14:25
And what about sorting by 'mstar' and then setting 'x_pos' of the objects A to 'x_pos' of object B where 'hostid' from object A is 'haloid' from object B? I need to identify them by the haloid, or do I miss something here? – firefly2517 Mar 09 '17 at 14:37
I don't understand this "'hostid' from object A is 'haloid' from object B". Does hostid A have the same value of haloid B? Or hostid A is a pointer, which pointing to haloid B? – frog Mar 09 '17 at 15:57

firefly2517 · Answer 3 · 2017-03-10T13:26:19.467

ANSWER TO QUESTION 11:

Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?

If I am resizing my array like this, I am creating to each field in dt 10 more columns. So I get the 'weird' result of Question 8b: a structure (10x3) haloids

The right way to trim my array, because I want to keep only the filled part of if (I designed the array to be big enough to contain a various amount of data blocks I am reading subsequently ...) is to:

data_array = data_array[:newsize]

print np.info(data_array)

class:  ndarray
shape:  (42648,)
strides:  (73,)
type: [('haloid', '<u8'), ('hostid', '<u8'), ('orphan', 'i1'), 
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), 
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]

PYTHON/NUMPY: Handling structured arrays compared to common numpy-arrays Python2.7

3 Answers3