I have a dataset of around 4.2 million observations. My code is below:
new_dataframe = original_dataframe %>%
group_by(user_id, date) %>%
summarise(delay = mean(delay, na.rm=TRUE)
)
This pipeline should be taking a 4.2 million x 3 dataframe with 3 columns: user_id, date, delay; and outputting a dataframe that's less than 4.2 million x 3.
A little bit about why I'm doing this, the problem involves users making payments on a given due date. Sometimes a user makes multiple payments for the same due date with different delay times (e.g. made a partial payment on the due date but completed the rest a few days later). I would like to have a single delay measure (the mean delay) associated with each unique user & due date combination.
For most due dates, users make a single payment so the mean function should essentially just copy a single number from the original dataframe to the new one. In all other cases there are at most 3 different delay values associated with a given due date.
My understanding is that the time complexity of this should be around O(2n), but this has been running for more than 24 hours on a powerful VM. Can anyone help me understand what I'm missing here? I'm beginning to wonder if this pipeline is instead O(n^2), by sorting user ID's and dates simultaneously instead of sequentially