My goal is to aggregate a pandas DataFrameGroupBy Object using the agg function.
In order to do that, I am generating a dictionary that I'm going to unpack to kwargs using dict unpacking through **dict
. This dictionary is required to contain the new column name as the key and a tuple as the value. The first value of the tuple is the column name that gets squeezed to a series and given to the second value as the input of lambda series: ...
.
agg_dict = {
f"{cat_name}_count": ('movement_state', lambda series: series.value_counts()[cat_name])
for cat_name in ml_data['category_column'].cat.categories
}
# Aggregating
agg_ml_data = ml_data.groupby(['col1', 'col2']).agg(**agg_dict)
What actually happens now is kinda weird for me.
Assuming:
ml_data['category_column'].cat.categories
Index(['cat1', 'cat2', 'cat3'], dtype='object')
The correct value counts for one group are
one_group['category_column'].value_counts()
| category_column
cat1 | 2
cat2 | 9
cat3 | 6
Expected output for one group:
cat1_count | cat2_count | cat3_count |
---|---|---|
2 | 9 | 6 |
Actual output for one group
cat1_count | cat2_count | cat3_count |
---|---|---|
6 | 6 | 6 |
Somehow, python executes the dict comprehension for the lambda function not as expected and uses just the last category value cat3
when indexing series.value_counts()[cat_name]
. I would expect, that the lambda functions are created as the dictionary itself is. Any idea on how to resolve that problem?