0

Now I have a dataframe like this:

df = pd.DataFrame({"asset":["a","b","c","a","b","c","b","c"],"v":[1,2,3,4,5,6,7,8],"date":["2017","2011","2012","2013","2014","2015","2016","2010"]})

I can calculate the pct_change by groupby and my function like this:

def fun(df):
    df = df.sort_values(by="date")
    df["pct_change"] = df["v"].pct_change()
    return df
df = df.groupby("asset",as_index=False).apply(fun)

Now I want to know how can I get the same result by polars?

1 Answers1

3

Here are two options. One using window functions, and one using groupby + explode.

You should benchmark and see which is faster on your use case.

preparing data

df = pl.DataFrame({
    "asset":["a","b","c","a","b","c","b","c"],
    "v":[1,2,3,4,5,6,7,8],
    "date":["2017","2011","2012","2013","2014","2015","2016","2010"]
})

using window functions


(  
  df.sort(["asset", "date"])
    .with_columns([
       pl.col("v").pct_change().over("asset").alias("pct_change")
    ])
)

using groupby + explode

(df.groupby("asset")
   .agg([
       pl.all().first(),
       pl.col("v").sort_by("date").pct_change().alias("pct_change")
   ]).explode("pct_change")
)

Result

Both output:

shape: (8, 4)
┌───────┬─────┬──────┬────────────┐
│ asset ┆ v   ┆ date ┆ pct_change │
│ ---   ┆ --- ┆ ---  ┆ ---        │
│ str   ┆ i64 ┆ str  ┆ f64        │
╞═══════╪═════╪══════╪════════════╡
│ a     ┆ 4   ┆ 2013 ┆ null       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a     ┆ 1   ┆ 2017 ┆ -0.75      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b     ┆ 2   ┆ 2011 ┆ null       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b     ┆ 5   ┆ 2014 ┆ 1.5        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b     ┆ 7   ┆ 2016 ┆ 0.4        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c     ┆ 8   ┆ 2010 ┆ null       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c     ┆ 3   ┆ 2012 ┆ -0.625     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c     ┆ 6   ┆ 2015 ┆ 1.0        │
└───────┴─────┴──────┴────────────┘
ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • I use your code "groupby + explode",I find the column "date" is not show as you. it's just "2011","2012","2017" etc – user3105812 Jul 19 '22 at 06:13
  • I don't understand what you mean. – ritchie46 Jul 19 '22 at 06:57
  • you can run the code "groupby + explode" ,it's not same as " window functions" – user3105812 Jul 19 '22 at 08:16
  • The output is correct per group. If you want to maintain the order of the groups, you can add `maintain_order=True` to the `groupby`, but this will be slower. – ritchie46 Jul 19 '22 at 08:34
  • For `groupby+explode` , `(df1.sort(['asset','date']).groupby("asset",maintain_order=True) .agg([ pl.col('v').list().alias('v'), pl.col('date').list().alias('date'), pl.col("v").pct_change().alias("pct_change") ]).explode(['v',"date","pct_change"]) )` – Aung Jan 13 '23 at 15:20