2

I would like to add multiple DataFrames with different shapes together.

Before adding the DataFrames, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).

Here is an example of the inputs:

import polars as pl

a = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 0, 0], "col_2": [1, 1, 1]}
)

b = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 1, 1], "col_2": [1, 1, 1]}
)

c = pl.DataFrame(
    data={"index": [1, 4, 5], "col_1": [10, 10, 10], "col_3": [1, 1, 1]}
)

The expected result would be:

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

The order of the columns is not a concern.

Here is a solution but it seems a little bit clunky:

from functools import reduce

columns = set()

for df in [a, b, c]:
    for column in df.columns:
        columns.add(column)

reshaped_df = []

for df in [a, b, c]:
    for column in columns:
        if column not in df.columns:
            df = df.with_columns(pl.lit(0).alias(column))
            reshaped_df.append(df)

reshaped_df = pl.align_frames(*reshaped_df, on="index", select=columns)

index = reshaped_df[0].select("index").to_series()

result = reduce(
    lambda a, b: a.select(pl.exclude("index").fill_null(value=0)) + b.select(pl.exclude("index").fill_null(value=0)),
    reshaped_df).hstack([index])
thoera
  • 23
  • 3

2 Answers2

4

There's also pl.concat(how="diagonal")

pl.concat([a, b, c], how="diagonal").groupby("index", maintain_order=True).sum()
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ null  │
│ 3     ┆ 1     ┆ 2     ┆ null  │
│ 4     ┆ 10    ┆ null  ┆ 1     │
│ 5     ┆ 10    ┆ null  ┆ 1     │
└───────┴───────┴───────┴───────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Didn't know about the `how="diagonal"` option, nice. – Timus May 14 '23 at 12:05
  • 1
    IMHO **if** `index` should be ordered in the result then you have to add a `.sort(by="index")`: the `maintain_order=True` isn't enough for that? E.g. if the first dataframe `a` lacks the first `index` it will come later in the result? – Timus May 14 '23 at 13:10
  • Nice! It's on a par with my initial solution when I benchmark it with ~10M rows but it's a lot cleaner and it works with a `LazyFrame` too! – thoera May 14 '23 at 13:11
  • @Timus Yes, that's right. It was just to keep the "seen order", not to sort by index. – jqurious May 14 '23 at 14:09
2

I'm not sure this is optimal, but with your dataframes a, b and c you could do

for i, df in enumerate((b, c)):
    mapping = {c: f"{c}_{i}" for c in df.columns if c != "index"}
    a = a.join(df.rename(mapping), on="index", how="outer")
a = a.fill_null(0).select([pl.col("index")] + [
    pl.sum(pl.col(f"^col_{i}.*$")).alias(f"col_{i}") for i in (1, 2, 3)
]).sort(by="index")

to get

┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

So first outer-join the dataframes on index with modified column names. Then sum over the corresponding columns, identified by the start of the column names.

Timus
  • 10,974
  • 5
  • 14
  • 28