3

I have a dataset of around 4.2 million observations. My code is below:

new_dataframe = original_dataframe %>%
   group_by(user_id, date) %>%
   summarise(delay = mean(delay, na.rm=TRUE)
   )

This pipeline should be taking a 4.2 million x 3 dataframe with 3 columns: user_id, date, delay; and outputting a dataframe that's less than 4.2 million x 3.

A little bit about why I'm doing this, the problem involves users making payments on a given due date. Sometimes a user makes multiple payments for the same due date with different delay times (e.g. made a partial payment on the due date but completed the rest a few days later). I would like to have a single delay measure (the mean delay) associated with each unique user & due date combination.

For most due dates, users make a single payment so the mean function should essentially just copy a single number from the original dataframe to the new one. In all other cases there are at most 3 different delay values associated with a given due date.

My understanding is that the time complexity of this should be around O(2n), but this has been running for more than 24 hours on a powerful VM. Can anyone help me understand what I'm missing here? I'm beginning to wonder if this pipeline is instead O(n^2), by sorting user ID's and dates simultaneously instead of sequentially

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
tvbc
  • 33
  • 3
  • I'm not sure I understand they way you're using "O(n^2)". If there are n rows in your table, then there at most n distinct combinations of user_id and date. – Tim Goodman Jul 07 '21 at 22:37
  • 2
    I think the long run time is more likely due to some specific issue with your data, rather than any problem with how dplyr implements those operations. I've definitely done dplyr group_by and summarize on tables as large as what you described and had it finish in a far more reasonable time, even just running on my PC and not large VM. Have you tried filtering to a smaller number of rows and seeing how long it takes to run on that data set, and if it produces any errors in the output? – Tim Goodman Jul 07 '21 at 22:45
  • That does seem long! How many groups do you have? If I remember correctly, having many groups can cause start to slow things down. I don't know if it will help at all but for **dplyr** syntax with **data.table** speed see package [**tidytable**](https://cran.r-project.org/web/packages/tidytable/index.html). – aosmith Jul 07 '21 at 22:55
  • @TimGoodman I was thinking that maybe the group_by command sorts all possible date and user ID combinations, making it n^2 for 2 columns of n each – tvbc Jul 07 '21 at 22:56
  • @aosmith there should be around 3.7 million groups, each representing a unique user-due date combination – tvbc Jul 07 '21 at 22:59
  • @TimGoodman Doing 100000 rows takes around a minute on a much less powerful local machine and it runs fine with no errors, doing 400000 on the same machine takes way more than 4 minutes though (haven't done that recently but I'm sure it's more than 30 minutes) – tvbc Jul 07 '21 at 23:04
  • @tvbc OK, I see what you're saying about the n^2 now, but I'm pretty sure it doesn't do any work for (user_id, date) pairs that don't actually occur paired together in your data. But one check would be to try doing the group_by on user_id alone... if that *still* takes a long time, then that rules out the problem being due to the interplay between the two grouping columns. – Tim Goodman Jul 08 '21 at 00:50
  • 1
    This currently-open issue sounds like it could be related: https://github.com/tidyverse/dplyr/issues/5017 – Tim Goodman Jul 08 '21 at 01:45
  • 1
    Actually, if `delay` is a `difftime`, it could be this issue: https://github.com/tidyverse/dplyr/issues/5113 (It's listed as closed because the fix is being tracked in the repo for one of dplyr's dependencies, here: https://github.com/r-lib/vctrs/issues/1293 ). If that's the problem, you might get much better performance by first converting `delay` to numeric . – Tim Goodman Jul 08 '21 at 02:01
  • @TimGoodman ```delay``` is ```difftime```. I'll try your suggestion and update the post. Thank you – tvbc Jul 09 '21 at 14:52
  • 2
    @TimGoodman Converting ```delay``` from ```difftime``` to ```numeric``` fixed the issue. Thank you!. If you would post this as the answer I'll accept it and mark this as closed – tvbc Jul 12 '21 at 22:51
  • @tvbc Answer added. I'm glad it helped you! – Tim Goodman Jul 14 '21 at 16:11

2 Answers2

3

This is due to this issue: https://github.com/tidyverse/dplyr/issues/5113

The poor performance results from the fact that delay is a difftime (as confirmed by the OP in the comments above), and difftime isn't (yet) supported by native C code. As a work around, convert the difftime to numeric before calling summarize.

Note: The above issue in the dplyr github repo is marked as closed only because it's now being tracked in the vctrs repo here: https://github.com/r-lib/vctrs/issues/1293

Tim Goodman
  • 23,308
  • 7
  • 64
  • 83
1

We can use data.table methods

library(data.table)
setDT(original_dataframe)[, .(delay = mean(delay, na.rm=TRUE)), by = .(user_id, date)]

Or use collapse

library(collapse)
collap(original_dataframe, delay ~ user_id + date, fmean)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • So are you saying that the code I have first copies each user ID and Due Date combination into memory first which is what's causing the slowdown? – tvbc Jul 07 '21 at 22:30