Pure python or itertools group a list of dates by days difference between each date

Question

Having a list of dates ordered:

[
datetime.date(2006, 8, 15),
datetime.date(2006, 9, 12),
datetime.date(2007, 8, 10),
datetime.date(2021, 4, 6),
datetime.date(2021, 4, 16),
datetime.date(2021, 4, 19)
...
]

I would like to have groups that contain dates that are a maximum of 30 days between all dates (the distance between first element of the group and the last of these group will be <= 30 days)

For example, using the previous list, I will get:

group_1 = [datetime.date(2006, 8, 15), datetime.date(2006, 9, 12)] (dates are <= 30 days between)
group_2 = [datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)] <=30
group_3 = [datetime.date(2007, 8, 10)] (no other dates related <= 30 days)

I tried to use iter-tools groupby but the key function doesn't allow 2 dates comparation like "lambda x,y: (x-y).days <= 30...." I don't know if I can use groupby for solve this problem or I need other itertools function. I know that I could build a python algorithm for it but I think that will exist a simple option to solve this but I didn't found it :(

Thanks!

2021-05-07 is within 30 days of 2021-04-19. Shouldn't group 4 and group 2 be merged? — Shaido, Oct 26 '21 at 06:19
@Shaido My interpretation of the verbiage of "groups that contain dates that are a maximum of 30 days between all dates" is that the last date of a group must be no more than 30 days later than the first date of the group, so 2021-05-07 does not belong to the second group because it is 31 days later than 2021-04-06. — blhsing, Oct 26 '21 at 06:43
@blhsing your interpretation is correct. Every first element of the group and the last element of these group will be <= 30 days between. So the objective is build groups of dates in a specific range — lsgrep, Oct 26 '21 at 06:48

RJ Adriaansen · Answer 1 · 2021-10-26T08:38:25.117

1

Here is a pandas solution that compares the dates with the next date and checks whether there are 30 days in between. It then assigns a group number with cumsum:

import pandas as pd
import datetime

data = [ datetime.date(2006, 8, 15), datetime.date(2006, 9, 12), datetime.date(2007, 8, 10), datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)]

df = pd.DataFrame(data, columns=['date'])
df['groups'] = (df['date'].diff() > pd.Timedelta(30, unit='D')).cumsum()

Output:

	date	groups
0	2006-08-15	0
1	2006-09-12	0
2	2007-08-10	1
3	2021-04-06	2
4	2021-04-16	2
5	2021-04-19	2

Or if you just want a list of lists as output: df.groupby((df['date'].diff() > pd.Timedelta(30, unit='D')).cumsum()).agg(list)['date'].to_list()

edited Oct 26 '21 at 08:38

answered Oct 26 '21 at 07:11

RJ Adriaansen

9,131
2
12
26

1

Also maybe see `pd.Series.diff`, which might tidy up the calculation involving `shift`. – TMBailey Oct 26 '21 at 07:27
@TMBailey, thanks. Much cleaner code like this. – RJ Adriaansen Oct 26 '21 at 08:39

score 1 · Answer 2 · answered Oct 26 '21 at 07:27

itertools.groupby is meant for grouping items that are not dependent on each other. In your case it is both easier and clearer to simply iterate through the list of dates to build list of groups by inserting a new group when the current date is more than 30 days later than the first date of the current group:

dates = [
    datetime.date(2006, 8, 15),
    datetime.date(2006, 9, 12),
    datetime.date(2007, 8, 10),
    datetime.date(2021, 4, 6),
    datetime.date(2021, 4, 16),
    datetime.date(2021, 4, 19),
    datetime.date(2021, 5, 7)
]
threshold = datetime.timedelta(30)
groups = []
for date in dates:
    if not groups or date - group[0] > threshold:
        group = []
        groups.append(group)
    group.append(date)

groups would become:

[[datetime.date(2006, 8, 15), datetime.date(2006, 9, 12)],
 [datetime.date(2007, 8, 10)],
 [datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)],
 [datetime.date(2021, 5, 7)]]

"itertools.groupby is meant for grouping items that are not dependent on each other" thank you for this clarification. Sometimes do you think that you can accomplish a task with a tool and you only focus in this tool. My mistake. thanks — lsgrep, Oct 26 '21 at 07:49

Yinchi Luo · Accepted Answer · 2021-10-26T07:07:38.297

An iterative solution with plain old for-loop is quite straightforward in this case.

I don't think it will be easy or efficient to use itertools to solve this problem since the grouping in this case depends on the context of the data, which will probably yield a O(N^2) solution, whereas iterative approach is O(N).

dts = [
datetime.date(2006, 8, 15),
datetime.date(2006, 9, 12),
datetime.date(2007, 8, 10),
datetime.date(2021, 4, 6),
datetime.date(2021, 4, 16),
datetime.date(2021, 4, 19)
]

def groupDateTimes(dts):
    i = 0
    ans = []
    group = []
    delta30days = datetime.timedelta(days=30)
    while i < len(dts):
        cur = dts[i]
        if not group:
            group.append(cur)
        elif cur - group[0] <= delta30days:
            group.append(cur)
        else:
            ans.append(group)
            group = [cur]
        i += 1
    if group:
        ans.append(group)
    return ans

print(groupDateTimes(dts)) // [[datetime.date(2006, 8, 15), datetime.date(2006, 9, 12)], [datetime.date(2007, 8, 10)], [datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)]]

Pure python or itertools group a list of dates by days difference between each date

3 Answers3