The other answers taught me a lot. I encountered a new situation where I wanted to easily be able to get each column labeled with all the structs it came from. i.e. for
pl.col("my").struct.field("test").struct.field("thing")
I wanted to recover
my.test.thing
as a string which I could easily use when reading a subset of columns with pyarrow via
pq.ParquetDataset(path).read(columns = ["my.test.thing"])
Since there are many hundreds of columns and the nesting can go quite deep, I wrote functions to do a depth first search on the schema, extract the columns in that pyarrow friendly format, then I can use those to select each column unnested all in one go.
First, I worked with the pyarrow schema because I couldn't figure out how to drill into the structs in the polars schema:
schema = df.to_arrow().schema
navigating structs in that schema is quirky, at the top level the structure behaves differently from deeper in. I ended up writing two functions, the first to navigate the top level structure and the second to continue the search below:
def schema_top_level_DFS(pa_schema):
top_level_stack = list(range(len(pa_schema)))
while top_level_stack:
working_top_level_index = top_level_stack.pop()
working_element_name = pa_schema.names[working_top_level_index]
if type(pa_schema.types[working_top_level_index]) == pa.lib.StructType:
second_level_stack = list(range(len(pa_schema.types[working_top_level_index])))
while second_level_stack:
working_second_level_index = second_level_stack.pop()
schema_DFS(pa_schema.types[working_top_level_index][working_second_level_index],working_element_name)
else:
column_paths.append(working_element_name)
def schema_DFS(incoming_element,upstream_names):
current_name = incoming_element.name
combined_names = ".".join([upstream_names,current_name])
if type(incoming_element.type) == pa.lib.StructType:
stack = list(range(len(incoming_element.type)))
while stack:
working_index = stack.pop()
working_element = incoming_element.type[working_index]
schema_DFS(working_element,combined_names)
else:
column_paths.append(combined_names)
So that running
column_paths = []
schema_top_level_DFS(schema)
gives me column paths like
['struct_name_1.inner_struct_name_2.thing1','struct_name_1.inner_struct_name_2.thing2]
to actually do the unnesting, I wasn't sure how to do better than a function with a case statement:
def return_pl_formatting(col_string):
col_list = col_string.split(".")
match len(col_list):
case 1:
return pl.col(col_list[0]).alias(col_string)
case 2:
return pl.col(col_list[0]).struct.field(col_list[1]).alias(col_string)
case 3:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).alias(col_string)
case 4:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).alias(col_string)
case 5:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).alias(col_string)
case 6:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).struct.field(col_list[5]).alias(col_string)
Then get my unnested and nicely named df with:
df.select([return_pl_formatting(x) for x in column_paths])
To show the output on the example from @Dean MacGregor
test = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
column_paths = []
schema_top_level_DFS(test.to_arrow().schema)
print(test.select([return_pl_formatting(x) for x in column_paths]))
┌─────┬───────────┬───────────┬───────────┬───────────┐
│ e ┆ abcd.cd.d ┆ abcd.cd.c ┆ abcd.ab.b ┆ abcd.ab.a │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 5 ┆ 4 ┆ 3 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ 3 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 6 ┆ 5 ┆ 4 ┆ 3 │
└─────┴───────────┴───────────┴───────────┴───────────┘