0

I want to groupby some columns, but one of the columns is either 'Fire' or 'Water'. I need to count the occurences of 'Fire' and 'Water' in seperate columns and also have a 'total' column which counts the sum of 'Fire' and 'Water'

Example:

df = pl.DataFrame({'ID': [0,0,1, 1, 1,0], 'Type': ['Fire', 'Fire', 'Fire', 'Water', 'Water', 'Water'], })
>>>df 
    shape: (6, 2)
┌─────┬───────┐
│ ID  ┆ Tye
│ i64 ┆ str   │
│ 0   ┆ Fire  │
│ 0   ┆ Fire  │
│ 1   ┆ Fire  │
│ 1   ┆ Water │
│ 1   ┆ Water │
│ 0   ┆ Water │

df = df.groupby(['ID', 'Type']).agg(pl.count())
shape: (4, 3)
┌─────┬───────┬───────┐
│ ID  ┆ Type  ┆ count │
│ --- ┆ ---   ┆ ---   │
│ i64 ┆ str   ┆ u32   │
╞═════╪═══════╪═══════╡
│ 1   ┆ Fire  ┆ 1     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆ Water ┆ 2     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0   ┆ Fire  ┆ 2     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0   ┆ Water ┆ 1     │
└─────┴───────┴───────┘

What i would rather love to have is:

┌─────┬───────┬──────┬───────┐
│ ID  ┆ water ┆ fire ┆ total │
│ --- ┆ ---   ┆ ---  ┆ ---   │
│ i64 ┆ u32   ┆ u32  ┆ u32   │
╞═════╪═══════╪══════╪═══════╡
│ 1   ┆ 2     ┆ 1    ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0   ┆ 1     ┆ 2    ┆ 3     │
└─────┴───────┴──────┴───────┘

How would i do that?

supersick
  • 261
  • 2
  • 14

1 Answers1

2

We can use the pivot function on a DataFrame to transform the DataFrame from "long format" to "wide format".

df = (
    df
    .pivot(
        values=['Type'],
        index=['ID'],
        columns=['Type'],
        aggregate_fn='count'
    )
)
df

shape: (2, 3)
┌─────┬──────┬───────┐
│ ID  ┆ Fire ┆ Water │
│ --- ┆ ---  ┆ ---   │
│ i64 ┆ u32  ┆ u32   │
╞═════╪══════╪═══════╡
│ 0   ┆ 2    ┆ 1     │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆ 1    ┆ 2     │
└─────┴──────┴───────┘

From here, we can create the total row by using the polars.sum expression. When used with a list of expressions, the polars.sum expression will sum horizontally, across rows.

df.with_column(
    pl.sum([col_nm for col_nm
            in df.columns
            if col_nm not in ["ID"]]).alias('total')
)
shape: (2, 4)
┌─────┬──────┬───────┬───────┐
│ ID  ┆ Fire ┆ Water ┆ total │
│ --- ┆ ---  ┆ ---   ┆ ---   │
│ i64 ┆ u32  ┆ u32   ┆ u32   │
╞═════╪══════╪═══════╪═══════╡
│ 0   ┆ 2    ┆ 1     ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆ 1    ┆ 2     ┆ 3     │
└─────┴──────┴───────┴───────┘

This Stack Overflow question explains more about how melt and pivot work together to convert between "long format" and "wide format" DataFrames.

One note: pivot only works with DataFrames, not LazyFrames.

  • 1
    I've edited the answer so that you don't necessarily have to list each "type" in the `pl.sum`. Instead, you can simply exclude the `ID` column (or the list of columns used in the `index` keyword of the `pivot` function). Note however, that you'll have to run them as separate steps, so that the `df` variable has the updated column names after the `pivot`. –  Jul 13 '22 at 16:11
  • 1
    Note that a `pivot` will never be added to `LazyFrame` because it is impossible to know the schema before running the query. – ritchie46 Jul 14 '22 at 04:09
  • Hi @cbilot, great advice!! Thank you !!! It didn't work out in the end completely because in my code i have one additional column which is a list of strings. Sadly adding the column which contains a list of str to pivots index param didn't work. Casting the column to str with pl.col().cast(pl.Utf8) also didn't work. I'll try to see if i can make it work today. – supersick Jul 14 '22 at 10:21
  • Edit: I just made it work. I used pl.col('mylist').arr.join(',') and i added it to the pivot index param – supersick Jul 14 '22 at 10:27