Python Numpy: Structured Arrays vs Same Datatype Array Operation Cost

Question

I want to create an array of arrays of the structure:

[line_number,count,temperature,humidity,sensor1_on,sensor2_on]

Where the first two need to be uint32, while temperature and humidity can be uint8, and the sensor_ons can be of type bool.

I later need to sort the 2d array based on the combination of line_number and then count. I also need to perform averages and other statistical computation on lists of all the temperature and humidity data (separately).

I found structured arrays which are convenient for data storage and retrieval:

np_data=np.zeros([num_lines],
                          dtype='uint32,'#Line No
                                'uint32,'# Count
                                'uint8,' #TEMP
                                'uint8,' #HUMID
                                'bool,' #S1 On
                                'bool'#S2 On
                          )

for this vs

np_data=np.zeros([num_lines,5],dtype='uint32') 
# I would pack my bools into the last uint32 and then unpack later 
# but it seems like a waste of space

Do I lose anything (numpy processing power, vectorized processing, sorting speed, etc) by creating the structured array vs the one with all the same data types? Is there another solution one would recommend?

I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses. — hpaulj, Nov 15 '18 at 00:34

kcw78 · Answer 1 · 2018-11-15T16:46:53.983

I did some performance testing on several array types. My test results are available as an answer at this topic:
is ndarray faster than recarray access?
(Ignore the downvote on my question. Apparently someone didn't like how I asked it.)

The short version: extracting data from a masked array was much slower than the same operation on a ndarray. Access times for a structured array and a recarray were slower than a ndarray, but all were fractions of a second. Clearly there is overhead when using masked arrays (maybe similar to a record array?). There is a good discussion of the differences between array types here:
numpy-discussion:structured-arrays-recarrays-and-record-arrays

There are other limitations. For example, many (most/all) of the numpy matrix and math operations are limited to ndarrays (require same data type). I don't think these apply to your case, since you are using the structured array like a table.

Python Numpy: Structured Arrays vs Same Datatype Array Operation Cost

1 Answers1