2

I have many millions of rows of structured arrays with a couple of different schemas in my dataset, some with multiple nested columns, but none with deeply nested columns (i.e. there are no nested columns other than the very first level of their schema).

Data is in this format because it is provided as such by a native python extension.

I'm trying to convert those arrays to pandas dataframes. Schema of each array has 10+ fields, so manually typing each field name (as in the example below) doesn't scale well considering there are arrays with different schemas.

An example:

import numpy as np
import pandas as pd

data = np.array(
  [((1,2,3), True), ((4,5,6), False)],
  dtype=[("nums", ('u4', 3)), ("v", "?")]
)

print(data)
# array([([1, 2, 3],  True), ([4, 5, 6], False)],
#      dtype=[('nums', '<u4', (3,)), ('v', '?')])

df = pd.DataFrame(data)

This is the error I get from pandas:

ValueError: Data must be 1-dimensional

Here is a manual way to do it:

In [14]: pd.DataFrame({"nums0": data["nums"][:, 0], "nums1": data["nums"][:, 1], "nums2": data["nums"][:, 2], "v": data["v"]})
Out[14]:
   nums0  nums1  nums2      v
0      1      2      3   True
1      4      5      6  False

How can I flatten data array such that it can be used in pandas dataframe? Is there a canonical way to do this?

Note: The reason I simply don't generalize above solution is that it's a bit more complicated than above simple example as the dataset also involves field offsets.

ofo
  • 1,160
  • 10
  • 21

4 Answers4

1

This'll get you what you need:

df = pd.DataFrame.from_records(data.tolist(), columns=data.dtype.names)
df = pd.concat([df.nums.apply(pd.Series), df.v], axis=1)
df.columns = [f"nums{i}" if type(i) == int else i for i in df.columns]
   nums0  nums1  nums2      v
0      1      2      3   True
1      4      5      6  False

I'm sure there's a nicer way to do this, including a better method for column names, but alas this is what I was able to come up with.

ddejohn
  • 8,775
  • 3
  • 17
  • 30
1

A bit of help from Numpy's recfunctions

from numpy.lib import recfunctions as rfn

flat = rfn.structured_to_unstructured(data)

fields = data.dtype.fields 

arr_cols = []
dtypes = {}
    for key, value in fields.items():
        content = value[0]
        # checks for nested data type
        # thankfully it is just one layer
        if content.subdtype: 
            dtype, counts = content.subdtype
            cols = [f"{key}_{num}" for num in range(counts[0])]
            arr_cols.extend(cols)
            d_types = {col:dtype for col in cols}
            dtypes.update(d_types)
       else:
           arr_cols.append(key)
           dtypes[key] = content

pd.DataFrame(flat, columns = arr_cols).astype(dtypes)

   nums_0  nums_1  nums_2      v
0       1       2       3   True
1       4       5       6  False
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

If all elements in your dataset have the same length, you can build separate DataFrames and then merge them:

pd.concat([pd.DataFrame([e[i] for e in data]) for i in range(len(data[0]))], 1)

    0   1   2   0
0   1   2   3   True
1   4   5   6   False

In the end, you can change column names as you wish.

Allen Qin
  • 19,507
  • 8
  • 51
  • 67
0

I ended up with this solution based on @sammywemmy's answer with the difference that this doesn't create an intermediate copy while accounting for offsets:

import numpy as np

def flatten_recarray(arr_like) -> np.ndarray:
    arr = np.array(arr_like)

    dtypes = {
        "names": [],
        "formats": [],
        "offsets": [],
        "itemsize": arr.itemsize,
    }

    def append(name: str, format: str, offset: int):
        dtypes["names"].append(name)
        dtypes["formats"].append(format)
        dtypes["offsets"].append(offset)

    for name, (dtype, offset) in arr.dtype.fields.items():
        if dtype.subdtype is None:
            append(name, dtype, offset)
        else:
            dtype, (size,) = dtype.subdtype
            for i in range(size):
                append(f"{name}_{i}", dtype.str, offset + dtype.itemsize * i)

    return arr.view(dtypes)  # type: ignore

Example:

import pandas as pd

dt = np.dtype(
    [
        ("integer", "i4"),
        ("tuple2", ("i8", 2)),
        ("tuple3", ("i4", 3)),
        ("boolean", "?"),
    ],
    align=True,
)

data = np.array(
    [
        (1, [2, 3], [4, 5, 6], True),
        (7, [8, 9], [10, 11, 12], False),
    ],
    dtype=dt,
)

flattened = flatten_recarray(data)
df = pd.DataFrame(flattened)

print(data)
print(flattened)
print(df)

Output:

[(1, [2, 3], [ 4,  5,  6],  True) (7, [8, 9], [10, 11, 12], False)]
[(1, 2, 3,  4,  5,  6,  True) (7, 8, 9, 10, 11, 12, False)]
   integer  tuple2_0  tuple2_1  tuple3_0  tuple3_1  tuple3_2  boolean
0        1         2         3         4         5         6     True
1        7         8         9        10        11        12    False
ofo
  • 1,160
  • 10
  • 21