Pythonic way to filter data with overlapping dates

Question

I have this data structure where each team has list of issues with start/end dates.

For each team, I would like to merge issues with same key and overlapping dates, where in result issue the start date will be smaller date and end date will be bigger date.

I am trying to do it with few for loops but I was wondering what would be the best Pythonic way to do this.

Update

I want to merge only issues with same key within same team and with overlapping dates.

Issues are not in chronological order.

Input:

{
    'Team A': [{
        'start': '11/Jul/13 1:49 PM',
        'end': '10/Oct/13 5:16 PM',
        'issue': 'KEY-12678'
    }, {
        'start': '3/Oct/13 10:40 AM',
        'end': '11/Nov/13 1:02 PM',
        'issue': 'KEY-12678'
    }],

    'Team B': [{
        'start': '5/Sep/13 3:35 PM',
        'end': '08/Nov/13 3:35 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '19/Aug/13 5:05 PM',
        'end': '10/Sep/13 5:16 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '09/Jul/13 9:15 AM',
        'end': '29/Jul/13 9:15 AM',
        'issue': 'KEY-12680'
    }]
}

Output:

{
    'Team A': [{
        'start': '11/Jul/13 1:49 PM',
        'end': '11/Nov/13 1:02 PM',
        'issue': 'KEY-12678'
    }],
    'Team B': [{
        'start': '19/Aug/13 5:05 PM',
        'end': '08/Nov/13 3:35 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '09/Jul/13 9:15 AM',
        'end': '29/Jul/13 9:15 AM',
        'issue': 'KEY-12680'
    }]
}

To parse date, here is date format (to save you couple of minutes):

date_format = "%d/%b/%y %H:%M %p"

Update, new test data

Input

d = {
"N/A": [
  {'start': '23/Jun/14 8:48 PM', 'end': '01/Aug/14 11:00 PM', 'issue': 'KEY-12157'}
  ,{'start': '09/Jul/13 1:57 PM',  'end': '29/Jul/13 1:57 PM', 'issue': 'KEY-12173'}
  ,{'start': '21/Aug/13 12:29 PM', 'end': '02/Dec/13 6:06 PM', 'issue': 'KEY-12173'}
  ,{'start': '17/Feb/14 3:17 PM', 'end': '18/Feb/14 5:51 PM', 'issue': 'KEY-12173'}
  ,{'start': '12/May/14 4:42 PM', 'end': '02/Jun/14 4:42 PM', 'issue': 'KEY-12173'}
  ,{'start': '24/Jun/14 11:33 AM',  'end': '01/Aug/14 11:49 AM', 'issue': 'KEY-12173'}
  ,{'start': '07/Oct/14 1:17 PM',  'end': '17/Nov/14 10:30 AM', 'issue': 'KEY-12173'}
  ,{'start': '31/Mar/15 1:58 PM', 'end': '12/May/15 4:26 PM', 'issue': 'KEY-12173'}
  ,{'start': '15/Jul/14 10:06 AM',  'end': '15/Sep/14 5:25 PM', 'issue': 'KEY-12173'}
  ,{'start': '06/Jan/15 10:46 AM',  'end': '26/Jan/15 10:46 AM', 'issue': 'KEY-20628'}
  ,{'start': '18/Nov/14 5:08 PM',  'end': '16/Feb/15 1:31 PM', 'issue': 'KEY-20628'}
  ,{'start': '02/Oct/13 12:32 PM', 'end': '21/Oct/13 5:32 PM', 'issue': 'KEY-12146'}
  ,{'start': '11/Mar/14 12:08 PM', 'end': '31/Mar/14 12:08 PM', 'issue': 'KEY-12681'}
  ]}

Output

{'start': '18/Nov/14 05:08 AM', 'issue': 'KEY-20628', 'end': '16/Feb/15 01:31 AM'}
{'start': '09/Jul/13 1:57 PM', 'issue': 'KEY-12173', 'end': '29/Jul/13 1:57 PM'}
{'start': '21/Aug/13 12:29 PM', 'issue': 'KEY-12173', 'end': '02/Dec/13 6:06 PM'}
{'start': '17/Feb/14 3:17 PM', 'issue': 'KEY-12173', 'end': '18/Feb/14 5:51 PM'}
{'start': '12/May/14 4:42 PM', 'issue': 'KEY-12173', 'end': '02/Jun/14 4:42 PM'}
{'start': '24/Jun/14 11:33 AM', 'issue': 'KEY-12173', 'end': '15/Sep/14 05:25 AM'}
{'start': '07/Oct/14 1:17 PM', 'issue': 'KEY-12173', 'end': '17/Nov/14 10:30 AM'}
{'start': '31/Mar/15 1:58 PM', 'issue': 'KEY-12173', 'end': '12/May/15 4:26 PM'}
{'start': '11/Mar/14 12:08 PM', 'issue': 'KEY-12681', 'end': '31/Mar/14 12:08 PM'}
{'start': '23/Jun/14 8:48 PM', 'issue': 'KEY-12157', 'end': '01/Aug/14 11:00 PM'}
{'start': '02/Oct/13 12:32 PM', 'issue': 'KEY-12146', 'end': '21/Oct/13 5:32 PM'}

Have you had a look at pandas? Specifically: http://pandas.pydata.org/pandas-docs/stable/timeseries.html — aquavitae, May 06 '15 at 09:35
@aquavitae first time I am seeing that, but it seems like something that could be handy, thanks — ThePavolC, May 06 '15 at 09:38

Mazdak · Answer 1 · 2015-05-05T19:35:45.493

You can convert your string dates to valid python datetime objects with '%d/%b/%y %H:%M %p' format and datetime.strftime function and use itertools.groupby to group the sub dictionaries based on issue key and you can loop over the ziped groups and extract that max and min with max and min functions with a proper key function :

from datetime import datetime
from itertools import groupby
from operator import itemgetter
new={}

for key in d:
  for dic in [ zip(*[i.items() for i in g]) for _,g in groupby(d[key],itemgetter('issue'))] :
    temp={}
    for p,t in [zip(*tup) for tup in dic]:
      val=p[0]
      if val=='start':
        temp[val]=min(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p'))
      elif val=='end':
        temp[val]=max(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p'))
      else:
        temp[val]=t[0]
    new.setdefault(key,[]).append(temp) 


print new

Result :

{'Team A': [{'start': '11/Jul/13 1:49 PM', 'end': '11/Nov/13 1:02 PM', 'issue': 'KEY-12678'}], 
 'Team B': [{'start': '19/Aug/13 5:05 PM', 'end': '08/Nov/13 3:35 PM', 'issue': 'KEY-12679'}, 
            {'start': '09/Jul/13 9:15 AM', 'end': '29/Jul/13 9:15 AM', 'issue': 'KEY-12680'}]}

Are you sure you've passed the correct result ? Two issues for Team A are overlapping so in result there should be only one. — ThePavolC, May 05 '15 at 16:50
@ThePavolC Yep, it just was a miss understanding! pls check out the edit! — Mazdak, May 05 '15 at 19:37
Can you explain to me how you check if two issues with same key have overlapping start/end dates ? — ThePavolC, May 06 '15 at 09:00
I am sorry for not making it clear. I'll give another sample data. — ThePavolC, May 06 '15 at 09:25

score 0 · Answer 2 · answered May 06 '15 at 09:36

Here is my current code, which seems to be working (it's bit tricky to check).

In my code I am using names as epic and mr, where each line in sample data is epic but issue key refers to mr.

from datetime import datetime
date_format = "%d/%b/%y %H:%M %p"

d = {"team" : [... sample data ...]}

def get_list_of_mrs(epics):
    mrs = set()
    for epic in epics:
        mrs.add(epic['issue'])
    return mrs

def is_overlap(epic1, epic2):
    start1 = datetime.strptime(epic1['start'], date_format)
    end1 = datetime.strptime(epic1['end'], date_format)
    start2 = datetime.strptime(epic2['start'], date_format)
    end2 = datetime.strptime(epic2['end'], date_format)
    return ((start1 <= end2) and (end1 >= start2))

def get_overlapping_dates(epic1, epic2):
    start1 = datetime.strptime(epic1['start'], date_format)
    end1 = datetime.strptime(epic1['end'], date_format)
    start2 = datetime.strptime(epic2['start'], date_format)
    end2 = datetime.strptime(epic2['end'], date_format)
    return (min(start1, start2), max(end1, end2))

def remove_overlaps(epics):
    filtered_epics = []
    for epic in epics:
        for temp_epic in epics:
            if temp_epic == epic:
                continue
            if epic.has_key('overlap'):
                continue
            if is_overlap(epic, temp_epic):
                temp_epic['overlap'] = True
                new_start, new_end = get_overlapping_dates(epic, temp_epic)
                epic['start'] = new_start.strftime(date_format)
                epic['end'] = new_end.strftime(date_format)

        filtered_epics.append(epic)

    filtered_epics = filter(lambda x: not x.has_key('overlap'), filtered_epics)

    return filtered_epics

for team in d:
    epics = d[team]
    epics.sort(key=lambda x: datetime.strptime(x['start'], date_format))
    uniq_mrs_in_team = get_list_of_mrs(epics)

    filtered_mrs = []
    for mr in uniq_mrs_in_team:
        mr_epics = filter(lambda x: x['issue'] == mr, epics)
        filtered = remove_overlaps(mr_epics)
        #print team, mr, len(mr_epics), len(filtered)
        for x_mr in mr_epics:
            #print " -",x_mr
            pass
        for x_mr in filtered:
            #print " +",x_mr
            pass
        filtered_mrs.extend(filtered)
    d[team] = filtered_mrs

If you think this is now answered, do you want to close this out? — Joe Ferndz, Jul 22 '20 at 08:49

Condla · Accepted Answer · 2015-06-13T23:13:43.180

I am presenting a pandas solution as hinted by aquavitae in a comment, which contains the following steps:

Read in the data from the dictionary d, that you provided into a DataFrame.
Convert the start and end column into datetime objects.
Sort data by keys and by start date and reset the index
~~Loop over the data frame (inefficient, but I couldn't come up with something better until now) and compare end time of current line with start time of next line, as well as if the keys are equal.~~
Query pandas data frame to get the lines that overlap
Loop over the to-be-dropped-out rows and merge the data into the corresponding overlapping row.
Drop these rows.
Convert back to the dictionary format.

This looks like:

import pandas as pd
import numpy as np
df = pd.DataFrame(d['N/A'])
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df.sort(['issue', 'start'], inplace=True)
df.index = range(len(df))
time_overlaps = df[:-1]['end'] > df[1:]['start']
same_issue = df[:-1]['issue'] == df[1:]['issue']
rows_to_drop = np.logical_and(time_overlaps, same_issue)
rows_to_drop_indices = [i+1 for i, j in enumerate(rows_to_drop) if j]
for i in rows_to_drop_indices:
    df.loc[i-1, 'end'] = df.loc[i, 'end']
df.drop(rows_to_drop_indices, inplace=True)

If you don't want to keep the DataFrame object and do further calculations in the format you specified in your question do:

df.to_dict('records')

EDIT: Found an efficient way to do it!

Pythonic way to filter data with overlapping dates

Update

Update, new test data

3 Answers3