2

I have 2 columns with similar Structs (same field names, field types, etc.).

nest = pl.DataFrame({
    'a':[{'x':1,'y':10},{'x':2,'y':20},],
    'b':[{'x':3,'y':30},{'x':4,'y':40},]    
})
print(nest)

shape: (2, 2)
┌───────────┬───────────┐
│ a         ┆ b         │
│ ---       ┆ ---       │
│ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╡
│ {1,10}    ┆ {3,30}    │
│ {2,20}    ┆ {4,40}    │
└───────────┴───────────┘

print(nest.schema)
{'a': Struct([Field('x', Int64), Field('y', Int64)]), 
 'b': Struct([Field('x', Int64), Field('y', Int64)])}

I want to unnest both those columns and get a flat data frame, with the fields suffixed to disambiguate them:

shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ x_a ┆ y_a ┆ x_b ┆ y_b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 1   ┆ 10  ┆ 3   ┆ 30  │
│ 2   ┆ 20  ┆ 4   ┆ 40  │
└─────┴─────┴─────┴─────┘

I tried:

nest.unnest('a','b')

but (of course) got DuplicateError for the names x and y.

Ideally something that will recursively flatten & disambiguate names using field paths :-(

Des1303
  • 79
  • 4

2 Answers2

1

Maybe relevant: Polars: Unnesting columns algorithmically without a for loop

There are also a few issues on the tracker around prefix/suffix customization:

You can manually loop over the schema:

nest.with_columns(
   [
      pl.col(col).struct.rename_fields(
         [f'{field.name}_{col}' for field in nest.schema[col].fields]
      )
      for col in nest.schema
   ]
).unnest('a', 'b')
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ x_a ┆ y_a ┆ x_b ┆ y_b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 1   ┆ 10  ┆ 3   ┆ 30  │
│ 2   ┆ 20  ┆ 4   ┆ 40  │
└─────┴─────┴─────┴─────┘

But it's not a true, recursive solution.

jqurious
  • 9,953
  • 1
  • 4
  • 14
0

The behavior of unnest is to error when it encounters duplicate columns. There's not a built in way to do what you're looking to do. I wrote this alternative unnest which does what you're looking for. You can monkey patch it to your pl.DataFrame namespace so you can use it directly...

def unnest(self, columns, *more_columns, prefix=None, suffix=None, col_prefix=False, col_suffix=False, drop_existing=False):
    if isinstance(columns, str):
        columns = [columns]
    if more_columns:
        columns = list(columns)
        columns.extend(more_columns)
    #check to see if any new parameters are used, if not just return as is current behavior
    if drop_existing==False and not (prefix or suffix or col_prefix or col_suffix):
        return self._from_pydf(self._df.unnest(columns))
    final_prefix=""
    final_suffix=""
    
    for col in columns:
        if col_prefix:
            final_prefix=col+"_"+prefix if prefix else col+"_"
        if col_suffix:
            final_suffix="_"+col+suffix if suffix else "_"+col
        tempdf = self[0].select(col)
        innercols = tempdf._from_pydf(tempdf._df.unnest([col])).columns
        newcols = [final_prefix+innercol+final_suffix for innercol in innercols]
        self = (
            self
                .with_columns(pl.col(col).struct.rename_fields(newcols))
                .drop([drop_col for drop_col in newcols if drop_col in self.columns])
        )
    return self._from_pydf(self._df.unnest(columns))
pl.DataFrame.unnest=unnest

After you do that then you can do what you're looking to do with just a parameter setting in my unnest

nest.unnest('a','b', col_suffix=True)
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ x_a ┆ y_a ┆ x_b ┆ y_b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 1   ┆ 10  ┆ 3   ┆ 30  │
│ 2   ┆ 20  ┆ 4   ┆ 40  │
└─────┴─────┴─────┴─────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72