4

My goal is to aggregate a pandas DataFrameGroupBy Object using the agg function.

In order to do that, I am generating a dictionary that I'm going to unpack to kwargs using dict unpacking through **dict. This dictionary is required to contain the new column name as the key and a tuple as the value. The first value of the tuple is the column name that gets squeezed to a series and given to the second value as the input of lambda series: ....

agg_dict = {
   f"{cat_name}_count": ('movement_state', lambda series: series.value_counts()[cat_name]) 
   for cat_name in ml_data['category_column'].cat.categories
}

# Aggregating
agg_ml_data = ml_data.groupby(['col1', 'col2']).agg(**agg_dict)

What actually happens now is kinda weird for me.

Assuming:

ml_data['category_column'].cat.categories
Index(['cat1', 'cat2', 'cat3'], dtype='object')

The correct value counts for one group are

one_group['category_column'].value_counts()
     | category_column
cat1 | 2
cat2 | 9
cat3 | 6

Expected output for one group:

cat1_count cat2_count cat3_count
2 9 6

Actual output for one group

cat1_count cat2_count cat3_count
6 6 6

Somehow, python executes the dict comprehension for the lambda function not as expected and uses just the last category value cat3 when indexing series.value_counts()[cat_name]. I would expect, that the lambda functions are created as the dictionary itself is. Any idea on how to resolve that problem?

Robin
  • 125
  • 5

1 Answers1

5

This is a classic Python trap.

When you use a free variable (cat_name, in this case) in a lambda expression, the lambda captures which variable the name refers to, not the value of that variable. So in this case, the lambda "remembers" that cat_name was "the loop variable of that dict comprehension". When the lambda is called, it looks up the value of "the loop variable of that dict comprehension", which now, since the dict comprehension has finished, remains at the last value of the list.

The usual way of working around this is to use a default argument to "freeze" the value, something like

lambda series, cat=cat_name: series.blah[cat]

effectively using one trap (Python computing default arguments at function definition time) to climb out of another. :-)

Ture Pålsson
  • 6,088
  • 2
  • 12
  • 15