Convert a C-generated nested ndarray into a 1D array in numpy, while preserving dtypes

Question

I have a .npy output that I am trying to parse. It is a nested ndarray with attached dtypes. I would like to "unpack" this array and convert it to a nice pandas DataFrame in the end, ideally while preserving the specified dtypes. Here is a chunk of imported data:

array([([(0,    11371952, [nan, nan, nan], [nan, nan, nan], 0, 0, nan, nan, 4096, -3.05175781e-05, nan, [2.51214953e-07, 2.51214953e-07, 2.51214953e-07], 2.23795245e-06, 2.37582674e-06, 0., 0., 0., 0., 0., 0.,    0.        ), (1,    11508720, [nan, nan, nan], [nan, nan, nan], 0, 0, nan, nan, 4096, -3.05175781e-05, nan, [2.51214953e-07, 2.51214953e-07, 2.51214953e-07], 2.23795245e-06, 2.37582674e-06, 0., 0., 0., 0., 0., 0.,    0.        )], 0,  0, 2.87718000e-01,      0, False, False, 255, 0),
      ...
       ([(0, 50474899154, [nan, nan, nan], [nan, nan, nan], 0, 0, nan, nan, 4096, -3.05175781e-05, nan, [2.51214953e-07, 2.51214953e-07, 2.51214953e-07], 2.63046954e-06, 2.51181307e-06, 0., 0., 0., 0., 0., 0.,    0.        ), (1, 50475035925, [nan, nan, nan], [nan, nan, nan], 0, 0, nan, nan, 4096, -3.05175781e-05, nan, [2.51214953e-07, 2.51214953e-07, 2.51214953e-07], 2.63046954e-06, 2.51181307e-06, 0., 0., 0., 0., 0., 0.,  500.06250781)], 0, 11, 1.26187590e+03, 133155, False, False, 255, 0)],
      dtype=[('itr', [('itr', '<i4'), ('tic', '<u8'), ('loc', '<f8', (3,)), ('lnc', '<f8', (3,)), ('eco', '<i4'), ('ecc', '<i4'), ('efo', '<f8'), ('efc', '<f8'), ('sta', '<i4'), ('cfr', '<f8'), ('dcr', '<f8'), ('ext', '<f8', (3,)), ('gvy', '<f8'), ('gvx', '<f8'), ('eoy', '<f8'), ('eox', '<f8'), ('dmz', '<f8'), ('lcy', '<f8'), ('lcx', '<f8'), ('lcz', '<f8'), ('fbg', '<f8')], (2,)), ('sqi', '<u4'), ('gri', '<u4'), ('tim', '<f8'), ('tid', '<i4'), ('vld', '?'), ('act', '?'), ('dos', '<i4'), ('sky', '<i4')])

The output should be a dataframe with 62 columns: itr, tic, locX, locY, locZ, lncX, lncY, lncZ, eco, ecc, .... etc.

Any help would be greatly appreciated. Thank you!

The closest I've gotten was using the following to "unwrap" the array.

from numpy.lib import recfunctions as rfn
unstdata = rfn.structured_to_unstructured(data) #dtype=data.dtype

But then e.g. unstdata[-1] output does not preserve dtypes (the commented out argument above does not work, it applies the entire dtype to each entry):

array([ 0.00000000e+00,  5.04748992e+10,             nan,             nan,
                   nan,             nan,             nan,             nan,
        0.00000000e+00,  0.00000000e+00,             nan,             nan,
        4.09600000e+03, -3.05175781e-05,             nan,  2.51214953e-07,
        2.51214953e-07,  2.51214953e-07,  2.63046954e-06,  2.51181307e-06,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  1.00000000e+00,
        5.04750359e+10,             nan,             nan,             nan,
                   nan,             nan,             nan,  0.00000000e+00,
        0.00000000e+00,             nan,             nan,  4.09600000e+03,
       -3.05175781e-05,             nan,  2.51214953e-07,  2.51214953e-07,
        2.51214953e-07,  2.63046954e-06,  2.51181307e-06,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  5.00062508e+02,  0.00000000e+00,  1.10000000e+01,
        1.26187590e+03,  1.33155000e+05,  0.00000000e+00,  0.00000000e+00,
        2.55000000e+02,  0.00000000e+00])

Nuka Kola · Answer 1 · 2023-05-31T08:55:48.300

Okay, it seems like I've found a work-around answer to my own question.

First, I copy/paste or type out the column names and dtype tuples:

itr_str = 'itr tic locX locY locZ lncX lncY lncZ eco ecc efo efc sta cfr dcr extX extY extZ gvy gvx eoy eox dmz lcy lcx lcz fbg'
gbl_str = 'sqi gri tim tid vld act dos sky'
itr_dtypes = [('itr', '<i4'), ('tic', '<u8'), ('locX', '<f8'), ('locY', '<f8'), ('locZ', '<f8'), ('lncX', '<f8'), 
              ('lncY', '<f8'), ('lncZ', '<f8'), ('eco', '<i4'), ('ecc', '<i4'), ('efo', '<f8'), ('efc', '<f8'), 
              ('sta', '<i4'), ('cfr', '<f8'), ('dcr', '<f8'), ('extX', '<f8'), ('extY', '<f8'), ('extZ', '<f8'), 
              ('gvy', '<f8'), ('gvx', '<f8'), ('eoy', '<f8'), ('eox', '<f8'), ('dmz', '<f8'), ('lcy', '<f8'), 
              ('lcx', '<f8'), ('lcz', '<f8'), ('fbg', '<f8')]
gbl_dtypes = [('sqi', '<u4'), ('gri', '<u4'), ('tim', '<f8'), ('tid', '<i4'), ('vld', 'bool'), ('act', 'bool'), 
              ('dos', '<i4'), ('sky', '<i4')]
colnames = itr_str.split() * 2 + gbl_str.split()
all_dtypes = itr_dtypes + gbl_dtypes

Next, I create a dictionary from tuples:

d = {}
for v, k in all_dtypes:
    d.update({v:k})

Lastly, I create the dataframe:

df = pd.DataFrame(unstrdata, columns = colnames).astype(d)

If anyone knows a more elegant way of extracting column names and dtypes directly from the original ndarray, without having to retype or copy/paste everything, I would greatly appreciate if you'd share. Thank you!

Convert a C-generated nested ndarray into a 1D array in numpy, while preserving dtypes

1 Answers1