I'm trying to port a pandas script to polars. I have a dataset that looks like that
sid,roi,endpoint,value,std,voxel_count
4213-a3_bl,AF_L,afd_along,0.40,0.21,57334
4213-a3_bl,AF_L,radfODF,0.08,0.045,57334
4213-a3_bl,AF_R,afd_along,0.42,0.22,53916
4213-a3_bl,AF_R,radfODF,0.08,0.04,53916
4213-a3_bl,CC_1,afd_along,,,
4213-a3_bl,CC_1,radfODF,,,
4213-a3_bl,CC_2a,afd_along,0.54,0.30,3264
4225-a3_bl,CC_2a,radfODF,0.06,0.04,3264
4225-a3_bl,CC_2b,afd_along,0.47,0.24,18833
... thousands of rows ...
I want to add a column based on a groupby
df.filter(pl.col('roi') == 'wm_mask').groupby('sid').first()
roi endpoint value std voxel_count
sid
4213-a3_bl wm_mask ad 0.001074 0.000237 602620
4225-a3_bl wm_mask ad 0.001071 0.000242 718758
4229-a3_bl wm_mask ad 0.001045 0.000243 579756
4473-a3_bl wm_mask ad 0.001059 0.000259 662894
4654-a3_bl wm_mask ad 0.001083 0.000234 562841
... ... ... ... ... ...
Now I want to add this new voxel_count
values that correspond to the right sid
, which should give something like
sid,roi,endpoint,value,std,voxel_count, wm_mask__count
4213-a3_bl,AF_L,afd_along,0.40,0.21,57334, 602620
4213-a3_bl,AF_L,radfODF,0.08,0.045,57334, 602620
4213-a3_bl,AF_R,afd_along,0.42,0.22,53916, 602620
4213-a3_bl,AF_R,radfODF,0.08,0.04,53916, 602620
4213-a3_bl,CC_1,afd_along,,,, 602620
4213-a3_bl,CC_1,radfODF,,,, 602620
4213-a3_bl,CC_2a,afd_along,0.54,0.30,3264, 602620
4225-a3_bl,CC_2a,radfODF,0.06,0.04,3264, 718758
4225-a3_bl,CC_2b,afd_along,0.47,0.24,18833, 718758
... thousands of rows ...
I tried various things but I always end up with AttributeError: _s
. Can you please tell me how to express that in polars?
If it can help, the associated pandas lines are
df = df.set_index("sid", drop=True)
df_wm_volumes = df[df.roi == "wm_mask"].groupby("sid", as_index=True).first()
df["wm_mask__volume"] = df_wm_volumes["voxel_count"]
df = df.reset_index(drop=False)