I have a Pandas dataframe with thousands of rows that I want to combine to smaller number of rows.
In the new dataframe, I want:
- Certain columns be summation of grouped rows
- Some columns be average of grouped rows, and
- Some other columns be weighted averaged of the same column with weight factor being another column (in the original dataframe).
my search shows I might use agg function to achieve the above piece (not too sure). But hardest part is to write the criteria with Pandas that select the number of rows needed to merged. Here is an example:
df = pd.DataFrame({'col1': [1, 1, 2, 2, 3, 4, 2],
'col2': [10, 20, 30, 40, 50, 60, 70],
'col3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
'col4': [2, 1, 4, 5, 8, 10, 3]})
My criteria for the number of rows to be merged is so that summation of column 1 values for them to be above a certain threshold. Assume it's 3 here:
- rows 1,2 and 3 be grouped (sum is 4 >3).
- rows 4 and 5 be grouped together (sum is 5>3)
- row 6 would not be grouped (4 > 3), stays same in the new dataframe.
- row 7 would not be grouped since no other row exists, stays same in the new dataframe.
Final dataframe would look like this:
Columns 1 and 2 are sum of grouped rows. Column 3 is average of grouped rows. Column 3 is weighted average of grouped rows with weight factor being column 1.
newdf = pd.DataFrame({'col1': [4, 5, 4, 2],
'col2': [60, 90, 60, 70],
'col3': [0.2, 0.45, 0.6, 0.7],
'col4': [2.75, 6.8, 10, 3]})
Can someone help me?
I looked into something like this but two issues remained: the grouping criteria doesn't work, and not sure how the weighted average
agg_funcs = {'col1': 'sum', 'col2': 'sum','col3': 'mean', 'col4': lambda x: (x['col1'] * x['col4']).sum() / x['col1'].sum()}
grouped_df = df.groupby('col1').agg(agg_funcs)
The code failed and complained about the weighted average piece, but the grouping criteria is not correct too!