4

I'm trying to find overlap between two members to see if they know each other. I also have a minimum overlap required(i.e. they need to know each-other for at least two months to form a group).

Example Input DF

time_together = 5184000 (60 days)

person_name  start_date  end_date    cut_off (start + time_together)
sally        1540627200  1545638400  1545811200
john         1543046400  1548316800  1548230400
edgar        1548316800  1553414400  1553500800

I currently have start date and end date in unix timestamps in a pandas data frame. I've calculated a cut off time that is the start time + minimum duration. I then check every persons attendance against the cutoff, if it is less than I say they will form a valid group(see code below)

df_new = pd.DataFrame()
for i in range(len(df.index)):
    start_range = (df.loc[i,'cutoff'] - df['start_timestamp'] > 0)
    end_range = (df.loc[i,'cutoff'] < df['end_timestamp'])
    df_new['%s%s' % (df.loc[i,'Soldier_SSN'],i)] = start_range & end_range

The problem is I now have a matrix of bools, and I need to generate an output that has the groups name. (see below for ideal output).

Current Output DF:

  sally  john  edgar
0 True   True  False
1 True   True  False
2 False  False False

Because sally and john have been together for the minimum time. They would form a group, but edgar hasn't.

The output would ideally be a list of lists [[person1, person2, person5], [person3, person4]]

It's also hella slow, so any suggestions on how to speed this up would be great.

1 Answers1

2

I think there is a lot going on in what you are trying to achieve, but it can be broken down into two steps. (and I'm not sure if any of this is the most performant way to achieve the goal)

  1. Find all the pairs of people who overlap with each other for the minimum period of time
  2. "Condense" the list of pairs into groups

For the first task a simple method is to just iterate through every person and check whether any other person has a sufficient overlap.

Starting with a test DataFrame (pseudo-random times and arbitrary names):

index  person_name  start_date  end_date
0   Angelina    1510568169  1523357075
1   Na  1555533506  1568322412
2   Twyla   1558758901  1571547807
3   Wilfredo    1551369432  1564158338
4   Estefana    1515025466  1527814372

We can find the pairs with:

pairs = []
for i in range(len(test.index)):
    for j in range(len(test.index)-i-1):
        if (min(test.loc[i]['end_date'], test.loc[i+j+1]['end_date']) 
        - max(test.loc[i]['start_date'], test.loc[i+j+1]['start_date']) 
        >= (min_time_together)):
            pairs.append([test.loc[i]['person_name'], test.loc[i+j+1]['person_name']])

This will generate the output:

[['Angelina', 'Estefana'],
 ['Na', 'Twyla'],
 ['Na', 'Wilfredo'],
 ['Twyla', 'Wilfredo']]

To "condense" this list of pairs involves a bunch of graph theory which to be honest I'm not an expert on BUT here is an amazing answer to a related StackOverflow question (Very interesting topic and lots of good information on that page). If we use the condenseBK function from that answer on our list of lists we get this final output:

#condenseBK(*pairs)
[['Angelina', 'Estefana'], ['Na', 'Twyla', 'Wilfredo']]
Ani
  • 566
  • 3
  • 4
  • This is really good solution! But I'm actually trying to avoid O(n^2) run time because I'm working with a huge dataset – Courtney Sprouse Oct 26 '18 at 15:10
  • Have you looked into using [Interval trees](https://en.wikipedia.org/wiki/Interval_tree)? I think you can build the tree in something like O(n log n) and query it in O(log n+number of matching intervals) – Ani Oct 26 '18 at 16:25