1

This question was asked before, but I want to extend on it. Because I do not have enough experience points I could not comment on the question so I am reposting the link below followed by my comments:

Grouping by date range with pandas

I believe asker of this question wants to group items together within a specified timedelta of each other (3 days is specified in the question). However the answers, including the one marked correct, relate to grouping items in frequencies of 3 days using Grouper. This eventually suits the asker because he only wants to group at most two items together, but what happens if this extends to three, four, five or more items?

Continuing the askers example code (which very closely relates to my own problem):

user_id     date       val
1           1-1-17     1
2           1-1-17     1
3           1-1-17     1
1           1-1-17     1
1           1-2-17     1
2           1-2-17     1
2           1-10-17    1
3           2-1-17     1
3           2-2-17     1
3           2-3-17     2
3           2-4-17     3
3           2-5-17     1

If the grouping would group by user_id and dates +/- 3 days from each other the group by summing val should look like:

user_id     date       sum(val)
1           1-2-17     3
2           1-2-17     2
2           1-10-17    1
3           1-1-17     1
3           2-1-17     8

I'm not sure the last date will actually show as 2-1-17, but the idea is to group all dates within a 3-day timedelta of each other together.

Is this possible in an elegant way using Grouper, resample or other Pandas or Python date functions?

Mario111
  • 35
  • 1
  • 10

1 Answers1

1

You can use a groupby with a custom group:

# convert to datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# set up groups of consecutive dates within ± 3 days
group = (s.groupby(df['user_id'])
          .apply(lambda s: s.diff().abs().gt('3days').cumsum())
         )

# group by ID and new group and aggregate
out = (df.groupby(['user_id', group], as_index=False)
         .agg({'date': 'last', 'val': 'sum'})
      )

output:

   user_id     date  val
0        1   1-2-17    3
1        2   1-2-17    2
2        2  1-10-17    1
3        3   1-1-17    1
4        3   2-5-17    8

intermediates (sorted by user_id for clarity):

    user_id     date  val   datetime    diff     abs  >3days  cumsum
0         1   1-1-17    1 2017-01-01     NaT     NaT   False       0
3         1   1-1-17    1 2017-01-01  0 days  0 days   False       0
4         1   1-2-17    1 2017-01-02  1 days  1 days   False       0
1         2   1-1-17    1 2017-01-01     NaT     NaT   False       0
5         2   1-2-17    1 2017-01-02  1 days  1 days   False       0
6         2  1-10-17    1 2017-01-10  8 days  8 days    True       1
2         3   1-1-17    1 2017-01-01     NaT     NaT   False       0
7         3   2-1-17    1 2017-02-01 31 days 31 days    True       1
8         3   2-2-17    1 2017-02-02  1 days  1 days   False       1
9         3   2-3-17    2 2017-02-03  1 days  1 days   False       1
10        3   2-4-17    3 2017-02-04  1 days  1 days   False       1
11        3   2-5-17    1 2017-02-05  1 days  1 days   False       1
mozway
  • 194,879
  • 13
  • 39
  • 75
  • 1
    Thanks @mozway, this seems to do the trick! I had to do some unpacking to understand it, but there is a bit I don't get. So 1) we isolating the dates, 2) group the isolated dates by user_id (and this is possible because they share the same index I presume?), then 3) apply the diff function (which assumes all dates are sorted) and then 4) chaining on abs to get the absolute values. What I don't understand is from gt and cumsum in the chain onwards. It seems we are creating a boolean filter, but with cumsum many values (including NAT) are converted to 0, so why does it work when used? – Mario111 Jun 29 '22 at 13:26
  • Yes (2) is possible due to index alignment, for (3), we get the absolute difference and convert to boolean, if greater than 3days we'll have a True, else a False. Using `cumsum` we create a new group for each True (as True is equivalent to 1, this increases by 1) and make the subsequent False have the same number/is (as False are equivalent to 0). Thus all dates after a gap of more than ±3 days are grouped together. I added the intermediate steps for clarity – mozway Jun 29 '22 at 14:29