I have many millions of rows of structured arrays with a couple of different schemas in my dataset, some with multiple nested columns, but none with deeply nested columns (i.e. there are no nested columns other than the very first level of their schema).
Data is in this format because it is provided as such by a native python extension.
I'm trying to convert those arrays to pandas dataframes. Schema of each array has 10+ fields, so manually typing each field name (as in the example below) doesn't scale well considering there are arrays with different schemas.
An example:
import numpy as np
import pandas as pd
data = np.array(
[((1,2,3), True), ((4,5,6), False)],
dtype=[("nums", ('u4', 3)), ("v", "?")]
)
print(data)
# array([([1, 2, 3], True), ([4, 5, 6], False)],
# dtype=[('nums', '<u4', (3,)), ('v', '?')])
df = pd.DataFrame(data)
This is the error I get from pandas:
ValueError: Data must be 1-dimensional
Here is a manual way to do it:
In [14]: pd.DataFrame({"nums0": data["nums"][:, 0], "nums1": data["nums"][:, 1], "nums2": data["nums"][:, 2], "v": data["v"]})
Out[14]:
nums0 nums1 nums2 v
0 1 2 3 True
1 4 5 6 False
How can I flatten data
array such that it can be used in pandas dataframe? Is there a canonical way to do this?
Note: The reason I simply don't generalize above solution is that it's a bit more complicated than above simple example as the dataset also involves field offsets.