I want to compute similarity between items (0,1,2,3..) based on their temporal information. Temporal information may be time instant (startdate), time interval (startdate,enddate) or null (NaT); see an example of dataframe (df_for) below.
- {instant,instant} : if equal sim = 1, else sim =0
- {instant,null} or vice versa, sim =0
- {instant, interval}: if instant within interval, sim =1 or if an interval contains an instant, sim = 1
- {interval,interval} : if intervals overlaps, sim = intersection of both intervals / union of both intervals
- {interval,interval} : if an interval contains another, then sim = 1
The following python code obtains temporal information from the dataframe and performs the conditions above (1-5). The code is verbose, i wonder if there is a smart way/lib to calculate similarity between time periods and time instants using python.
m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
for j in range(m):
if i != j:
st1 = data[i][0]
ed1 = data[i][1]
st2 = data[j][0]
ed2 = data[j][1]
#both items are null values
if pd.isnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
sim[i][j] = 0.
# {instant, instant} => equal, not equal
if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
if st1 == st2:
sim[i][j] = 1.
else:
sim[i][j] = 0.
# {instant, null} => sim is 0
if pd.notnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
sim[i][j] = 0.
# {instant, interval} => meets, during
if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
if(st2 <= st1 <= ed2):
sim[i][j] = 1. #a time is between two other times
else:
sim[i][j] = 0.
# {interval, instant} => meets, contains
if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
if(st1 <= st2 <= ed1):
sim[i][j] = 1. #a time is between two other times
else:
sim[i][j] = 0.
# {interval, interval} => equal, overlaps, not overlaps
if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
if (st1 <= st2 <= ed1) or (st2 <= st1 <= ed2):
intersect = min(ed1,ed2)- max(st1,st2) # earliestend-lateststart
union = max(st1,st2,ed1,ed2) - min(ed1,ed2,st1,st2)
overlaps = intersect/union
#print(intersect/np.timedelta64(1, 'D'),union/np.timedelta64(1, 'D'))
if (st1 > st2 and ed1 < ed2) or (st1 < st2 and ed1 > ed2): # contains, during
overlaps = 1.0
sim[i][j]=overlaps
else:
sim[i][j] = 0.
else:
sim[i][j] = 1.