6

I have this data structure where each team has list of issues with start/end dates.

For each team, I would like to merge issues with same key and overlapping dates, where in result issue the start date will be smaller date and end date will be bigger date.

I am trying to do it with few for loops but I was wondering what would be the best Pythonic way to do this.

Update

I want to merge only issues with same key within same team and with overlapping dates.

Issues are not in chronological order.

Input:

{
    'Team A': [{
        'start': '11/Jul/13 1:49 PM',
        'end': '10/Oct/13 5:16 PM',
        'issue': 'KEY-12678'
    }, {
        'start': '3/Oct/13 10:40 AM',
        'end': '11/Nov/13 1:02 PM',
        'issue': 'KEY-12678'
    }],

    'Team B': [{
        'start': '5/Sep/13 3:35 PM',
        'end': '08/Nov/13 3:35 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '19/Aug/13 5:05 PM',
        'end': '10/Sep/13 5:16 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '09/Jul/13 9:15 AM',
        'end': '29/Jul/13 9:15 AM',
        'issue': 'KEY-12680'
    }]
}

Output:

{
    'Team A': [{
        'start': '11/Jul/13 1:49 PM',
        'end': '11/Nov/13 1:02 PM',
        'issue': 'KEY-12678'
    }],
    'Team B': [{
        'start': '19/Aug/13 5:05 PM',
        'end': '08/Nov/13 3:35 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '09/Jul/13 9:15 AM',
        'end': '29/Jul/13 9:15 AM',
        'issue': 'KEY-12680'
    }]
}

To parse date, here is date format (to save you couple of minutes):

date_format = "%d/%b/%y %H:%M %p"

Update, new test data

Input

d = {
"N/A": [
  {'start': '23/Jun/14 8:48 PM', 'end': '01/Aug/14 11:00 PM', 'issue': 'KEY-12157'}
  ,{'start': '09/Jul/13 1:57 PM',  'end': '29/Jul/13 1:57 PM', 'issue': 'KEY-12173'}
  ,{'start': '21/Aug/13 12:29 PM', 'end': '02/Dec/13 6:06 PM', 'issue': 'KEY-12173'}
  ,{'start': '17/Feb/14 3:17 PM', 'end': '18/Feb/14 5:51 PM', 'issue': 'KEY-12173'}
  ,{'start': '12/May/14 4:42 PM', 'end': '02/Jun/14 4:42 PM', 'issue': 'KEY-12173'}
  ,{'start': '24/Jun/14 11:33 AM',  'end': '01/Aug/14 11:49 AM', 'issue': 'KEY-12173'}
  ,{'start': '07/Oct/14 1:17 PM',  'end': '17/Nov/14 10:30 AM', 'issue': 'KEY-12173'}
  ,{'start': '31/Mar/15 1:58 PM', 'end': '12/May/15 4:26 PM', 'issue': 'KEY-12173'}
  ,{'start': '15/Jul/14 10:06 AM',  'end': '15/Sep/14 5:25 PM', 'issue': 'KEY-12173'}
  ,{'start': '06/Jan/15 10:46 AM',  'end': '26/Jan/15 10:46 AM', 'issue': 'KEY-20628'}
  ,{'start': '18/Nov/14 5:08 PM',  'end': '16/Feb/15 1:31 PM', 'issue': 'KEY-20628'}
  ,{'start': '02/Oct/13 12:32 PM', 'end': '21/Oct/13 5:32 PM', 'issue': 'KEY-12146'}
  ,{'start': '11/Mar/14 12:08 PM', 'end': '31/Mar/14 12:08 PM', 'issue': 'KEY-12681'}
  ]}

Output

{'start': '18/Nov/14 05:08 AM', 'issue': 'KEY-20628', 'end': '16/Feb/15 01:31 AM'}
{'start': '09/Jul/13 1:57 PM', 'issue': 'KEY-12173', 'end': '29/Jul/13 1:57 PM'}
{'start': '21/Aug/13 12:29 PM', 'issue': 'KEY-12173', 'end': '02/Dec/13 6:06 PM'}
{'start': '17/Feb/14 3:17 PM', 'issue': 'KEY-12173', 'end': '18/Feb/14 5:51 PM'}
{'start': '12/May/14 4:42 PM', 'issue': 'KEY-12173', 'end': '02/Jun/14 4:42 PM'}
{'start': '24/Jun/14 11:33 AM', 'issue': 'KEY-12173', 'end': '15/Sep/14 05:25 AM'}
{'start': '07/Oct/14 1:17 PM', 'issue': 'KEY-12173', 'end': '17/Nov/14 10:30 AM'}
{'start': '31/Mar/15 1:58 PM', 'issue': 'KEY-12173', 'end': '12/May/15 4:26 PM'}
{'start': '11/Mar/14 12:08 PM', 'issue': 'KEY-12681', 'end': '31/Mar/14 12:08 PM'}
{'start': '23/Jun/14 8:48 PM', 'issue': 'KEY-12157', 'end': '01/Aug/14 11:00 PM'}
{'start': '02/Oct/13 12:32 PM', 'issue': 'KEY-12146', 'end': '21/Oct/13 5:32 PM'}
Community
  • 1
  • 1
ThePavolC
  • 1,693
  • 1
  • 16
  • 26

3 Answers3

0

You can convert your string dates to valid python datetime objects with '%d/%b/%y %H:%M %p' format and datetime.strftime function and use itertools.groupby to group the sub dictionaries based on issue key and you can loop over the ziped groups and extract that max and min with max and min functions with a proper key function :

from datetime import datetime
from itertools import groupby
from operator import itemgetter
new={}

for key in d:
  for dic in [ zip(*[i.items() for i in g]) for _,g in groupby(d[key],itemgetter('issue'))] :
    temp={}
    for p,t in [zip(*tup) for tup in dic]:
      val=p[0]
      if val=='start':
        temp[val]=min(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p'))
      elif val=='end':
        temp[val]=max(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p'))
      else:
        temp[val]=t[0]
    new.setdefault(key,[]).append(temp) 


print new

Result :

{'Team A': [{'start': '11/Jul/13 1:49 PM', 'end': '11/Nov/13 1:02 PM', 'issue': 'KEY-12678'}], 
 'Team B': [{'start': '19/Aug/13 5:05 PM', 'end': '08/Nov/13 3:35 PM', 'issue': 'KEY-12679'}, 
            {'start': '09/Jul/13 9:15 AM', 'end': '29/Jul/13 9:15 AM', 'issue': 'KEY-12680'}]}
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

Here is my current code, which seems to be working (it's bit tricky to check).

In my code I am using names as epic and mr, where each line in sample data is epic but issue key refers to mr.

from datetime import datetime
date_format = "%d/%b/%y %H:%M %p"

d = {"team" : [... sample data ...]}

def get_list_of_mrs(epics):
    mrs = set()
    for epic in epics:
        mrs.add(epic['issue'])
    return mrs

def is_overlap(epic1, epic2):
    start1 = datetime.strptime(epic1['start'], date_format)
    end1 = datetime.strptime(epic1['end'], date_format)
    start2 = datetime.strptime(epic2['start'], date_format)
    end2 = datetime.strptime(epic2['end'], date_format)
    return ((start1 <= end2) and (end1 >= start2))

def get_overlapping_dates(epic1, epic2):
    start1 = datetime.strptime(epic1['start'], date_format)
    end1 = datetime.strptime(epic1['end'], date_format)
    start2 = datetime.strptime(epic2['start'], date_format)
    end2 = datetime.strptime(epic2['end'], date_format)
    return (min(start1, start2), max(end1, end2))

def remove_overlaps(epics):
    filtered_epics = []
    for epic in epics:
        for temp_epic in epics:
            if temp_epic == epic:
                continue
            if epic.has_key('overlap'):
                continue
            if is_overlap(epic, temp_epic):
                temp_epic['overlap'] = True
                new_start, new_end = get_overlapping_dates(epic, temp_epic)
                epic['start'] = new_start.strftime(date_format)
                epic['end'] = new_end.strftime(date_format)

        filtered_epics.append(epic)

    filtered_epics = filter(lambda x: not x.has_key('overlap'), filtered_epics)

    return filtered_epics

for team in d:
    epics = d[team]
    epics.sort(key=lambda x: datetime.strptime(x['start'], date_format))
    uniq_mrs_in_team = get_list_of_mrs(epics)

    filtered_mrs = []
    for mr in uniq_mrs_in_team:
        mr_epics = filter(lambda x: x['issue'] == mr, epics)
        filtered = remove_overlaps(mr_epics)
        #print team, mr, len(mr_epics), len(filtered)
        for x_mr in mr_epics:
            #print " -",x_mr
            pass
        for x_mr in filtered:
            #print " +",x_mr
            pass
        filtered_mrs.extend(filtered)
    d[team] = filtered_mrs
ThePavolC
  • 1,693
  • 1
  • 16
  • 26
0

I am presenting a pandas solution as hinted by aquavitae in a comment, which contains the following steps:

  • Read in the data from the dictionary d, that you provided into a DataFrame.
  • Convert the start and end column into datetime objects.
  • Sort data by keys and by start date and reset the index
  • Loop over the data frame (inefficient, but I couldn't come up with something better until now) and compare end time of current line with start time of next line, as well as if the keys are equal.
  • Query pandas data frame to get the lines that overlap
  • Loop over the to-be-dropped-out rows and merge the data into the corresponding overlapping row.
  • Drop these rows.
  • Convert back to the dictionary format.

This looks like:

import pandas as pd
import numpy as np
df = pd.DataFrame(d['N/A'])
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df.sort(['issue', 'start'], inplace=True)
df.index = range(len(df))
time_overlaps = df[:-1]['end'] > df[1:]['start']
same_issue = df[:-1]['issue'] == df[1:]['issue']
rows_to_drop = np.logical_and(time_overlaps, same_issue)
rows_to_drop_indices = [i+1 for i, j in enumerate(rows_to_drop) if j]
for i in rows_to_drop_indices:
    df.loc[i-1, 'end'] = df.loc[i, 'end']
df.drop(rows_to_drop_indices, inplace=True)

If you don't want to keep the DataFrame object and do further calculations in the format you specified in your question do:

df.to_dict('records')

EDIT: Found an efficient way to do it!

Condla
  • 195
  • 1
  • 10