I have dataframe containing 500k+ records and I would like to group-by multiple columns (data type of string and date) and later pick only few records inside each group based on custom condition.
Basically, I need to group the records (by first_roll_up
, date
, granular_timestamp
) to check if the group contains any value for column top
and if present, choose only the record with top
value. Also, if the group doesn't contain any record with top
value, choose all the records.
Input:
first_roll_up | sub | top | date | granular_timestamp | values |
---|---|---|---|---|---|
ABC | T1 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
ABC | SUB_A_1 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
ABC | SUB_A_2 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
ABC | SUB_A_3 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
XYZ | SUB_X_1 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_X_2 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_Y_1 | 2/12/2022 | 2/10/2022 12:00:00:000 | . |
Output:
first_roll_up | sub | top | date | granular_timestamp | values |
---|---|---|---|---|---|
ABC | T1 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
XYZ | SUB_X_1 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_X_2 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_Y_1 | 2/12/2022 | 2/10/2022 12:00:00:000 | . |
I tried to perform the below, but the function is taking 10+ mins to complete. I tried transform
instead of apply by adding new boolean column to identify groups, but it didn't help too.
df.groupby(['first_roll_up', 'sub', 'top', 'date', 'granular_timestamp'], sort=False)
.apply(custom_function_to_filter_each_group_records)