0

I'm working with a data frame in which people can appear with multiple roles and I need to devise a test to see for a given person, do they have any dates that overlap:

import pandas as pd
records = pd.DataFrame({'name': ['Tom','Harry','Jack','Matt','Harry','Matt'], 
'job code': [101,101,301,101,401,102], 'start date': ['1/1/20','1/1/20','1/1/20','1/1/20','5/1/20','6/15/20'], 'end date':['12/31/20','4/30/20','12/31/20','11/30/20','12/31/20','12/31/20']})

From this dataset you can see at a glance that everyone is fine except for Matt - he has job dates that overlap which is not allowed. How can I test for this in pandas, checking that each unique name does not have any overlap and flagging the entries that do?

Thanks!

Sean R
  • 173
  • 1
  • 8
  • If either of our answers helped you solve your problem, consider accepting the answer. If there's a reason that you aren't satisfied, let us know what's missing. – amquack Jan 26 '21 at 18:47

2 Answers2

2

The criteria for overlapping would be

max(start_dates) < min(end_date)

So you could merge and query:

(records.merge(records, on='name')
     .loc[lambda x: x['job code_x'] != x['job code_y']]
     .loc[lambda x: x.filter(like='start date').max(1) <= x.filter(like='end date').min(1)]
     ['name'].unique()
)

Output:

['Matt']
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

The first thing I would do is convert to datetime objects:

records['start date']=pd.to_datetime(records['start date'])
records['end date']=pd.to_datetime(records['end date'])

Then I can work with these rather than strings:

import datetime as dt
# I sort based on name and start date:
records2=records.sort_values(['name', 'start date'])

I then create a new column, which compares the start date to the end date, and returns True if a job overlaps with the subsequent job (False otherwise). This is more specific than what you asked, as it gets to the job level, but you could change this to be True if any jobs overlap for a person.

records2['overlap']=(records2['end date']-records2['start date'].shift(-1).where(records2['name'].eq(records2['name'].shift(-1))))>dt.timedelta(0)
records2

Which returns:

    name    job code    start date  end date    overlap
1   Harry   101 2020-01-01  2020-04-30  False
4   Harry   401 2020-05-01  2020-12-31  False
2   Jack    301 2020-01-01  2020-12-31  False
3   Matt    101 2020-01-01  2020-11-30  True
5   Matt    102 2020-06-15  2020-12-31  False
0   Tom 101 2020-01-01  2020-12-31  False

This is a helpful question for using shift in conjunction with groups, and there are some nice and different ways to do this. I pulled from the second answer.

If you're interested in how many times each person has an overlap, you can use the following code to create a dataframe with that information:

df=records2.groupby('name').sum('overlap')
df

Which returns:

    job code    overlap
name        
Harry   502 0
Jack    301 0
Matt    203 1
Tom 101 0
amquack
  • 837
  • 10
  • 24
  • very helpful, thanks! What would I change to look at any overlap (not just subsequent job overlap?) would I remove the shift function? – Sean R Jan 20 '21 at 20:44
  • 1
    @SeanR the shift function allows you to compare the current row to nearby rows, so this is not something you want to remove. I've added a way to determine how many times each individual triggered 'True' in the 'overlap' column. – amquack Jan 20 '21 at 21:01