Python - Compute similarities between items based on temporal information (instant, interval)

Question

I want to compute similarity between items (0,1,2,3..) based on their temporal information. Temporal information may be time instant (startdate), time interval (startdate,enddate) or null (NaT); see an example of dataframe (df_for) below.

{instant,instant} : if equal sim = 1, else sim =0
{instant,null} or vice versa, sim =0
{instant, interval}: if instant within interval, sim =1 or if an interval contains an instant, sim = 1
{interval,interval} : if intervals overlaps, sim = intersection of both intervals / union of both intervals
{interval,interval} : if an interval contains another, then sim = 1

The following python code obtains temporal information from the dataframe and performs the conditions above (1-5). The code is verbose, i wonder if there is a smart way/lib to calculate similarity between time periods and time instants using python.

m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
    for j in range(m):
        if i != j:
            st1 = data[i][0]
            ed1 = data[i][1]
            st2 = data[j][0]
            ed2 = data[j][1]
            #both items are null values
            if pd.isnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
                sim[i][j] = 0.
            # {instant, instant} => equal, not equal
            if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
                if st1 == st2:
                    sim[i][j] = 1.
                else:
                    sim[i][j] = 0.
            # {instant, null} => sim is 0
            if pd.notnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
                sim[i][j] = 0.
            # {instant, interval} => meets, during
            if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
                    if(st2 <= st1 <= ed2):
                        sim[i][j] = 1. #a time is between two other times
                    else:
                        sim[i][j] = 0.
            # {interval, instant} => meets, contains
            if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
                    if(st1 <= st2 <= ed1):
                        sim[i][j] = 1. #a time is between two other times
                    else:
                        sim[i][j] = 0.
            # {interval, interval} => equal, overlaps, not overlaps
            if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.notnull(ed2): 
                if (st1 <= st2 <= ed1) or (st2 <= st1 <= ed2):
                    intersect = min(ed1,ed2)- max(st1,st2) # earliestend-lateststart
                    union = max(st1,st2,ed1,ed2) - min(ed1,ed2,st1,st2)
                    overlaps = intersect/union
                    #print(intersect/np.timedelta64(1, 'D'),union/np.timedelta64(1, 'D'))
                    if (st1 > st2 and ed1 < ed2) or (st1 < st2 and ed1 > ed2): # contains, during
                        overlaps = 1.0
                    sim[i][j]=overlaps  
                else:
                    sim[i][j] = 0.  
        else:
            sim[i][j] = 1.

Blckknght · Accepted Answer · 2017-01-09T22:57:10.977

I can see several ways to simplify the code.

First, I'd suggest moving the comparison code to a separate function that takes st1, ed1, st2, and ed2 as arguments. (As a side note, why st (start time) but ed (end date)? Consistent names might be nice.) You'd be able to return your result to the calling code, which would be responsible for doing the assignment into the array of results.
Speaking of that calling code... It doesn't need to call the comparison function for every pair of time-ranges. The results should always be symmetrical (e.g. compare(data[i][0], data[i][1], data[j][0], data[j][1]) is going to return the same thing as compare(data[j][0], data[j][1], data[i][0], data[i][1])). So just call one of those, and assign the results to both sim[i][j] and sim[j][i].
Rather than lots of if some_test_expression: return 1; else: return 0 blocks, just return the results of the comparison: return some_test_expression. Python's bool type is a subclass of int, so these should convert to numbers automatically when you assign them into a numpy array (you could also cast to float explicitly if you want the function to always return the same type).
Handle all cases with nulls at the top. They're an easy case, and if you deal with them first (and return, so the rest of the function doesn't run), you don't need to check as many things for null later. You don't need to handle null/null, null/instantand null/interval separately, just return zero if either start value is null: if isnull(st1) or isnull(st2): return 0.
Instant/interval comparisons are symmetrical, so just write the logic one way and flip the parameters around if they're initially in the wrong order: if isnull(ed2): st1, ed1, st2, ed2 = st2, ed2, st1, ed1
I don't see too much that can be improved with the specific implementations of your comparisons between instants and intervals or between two intervals (other than the general things mentioned already above). However, I would like to say that your formula for overlapping intervals will have a big discontinuity in similarity as a small interval slowly overlaps more with a larger one, then gets fully contained. For instance, a one second interval that starts just a millisecond before a one hour interval will result in a similarity value between the two of 0.000277 (0.999 / (3600 + 0.001)), but one that starts at the same time as the larger interval does will have similarity of 1.0 (a big difference). A more continuously changing measure of similarity would come from dividing the overlap amount by the size of the smaller interval (rather than the size of the union). This formula doesn't need a special case for one interval fully containing another, since the default calculation will give a similarity of 1.0 already (the overlap will be the size of the smaller interval, so you'll be dividing it by itself).

So, putting all that together, here's how I'd rewrite things:

def compare_intervals(st1, et1, st2, et2): # note I've renamed the ed's to et
    # all nulls
    if pd.isnull(st1) or pd.isnull(st2):
        return 0.

    # {instant, instant} => equal, not equal
    if pd.isnull(et1) and pd.isnull(et2):
        return float(st1 == st2)

    # {instant, interval} => flip params (to be handled in next block)
    if pd.isnull(et1):
        st1, et1, st2, et2 = st2, et2, st1, et1

    # {interval, instant} => meets, contains
    if pd.isnull(et2):
        return float(st1 <= st2 <= et1)

    # {interval, interval} => equal, overlaps, not overlaps
    if (st1 <= st2 <= et1) or (st2 <= st1 <= et2):
        intersect = min(et1, et2) - max(st1, st2)
        min_interval = min(et1 - st1, et2 - st2) # using minimum instead of union
        return intersect / min_interval

    return 0. # intervals didn't overlap

m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
    for j in range(i, m): # we only iterate on j values >= i
        if i == j:
            sim[i,j] = 1.
        else:
            sim[i,j] = sim[j,i] = compare_intervals(data[i][0], data[i][1],
                                                    data[j][0], data[j][1])

A few notes that I couldn't fit into comments in the code:

No test for the values being intervals is needed at the last step of the compare_interval function, since all other cases (involving instants and nulls) have already been handled.
I've changed the assignments to the sim array to use tuple indexes, as that's more natural than nested indexes for multidimensional numpy arrays (slices don't work if you use nested indexes). You might be able to do the same thing with the data lookups, but I've not changed them since I don't know pandas nearly as well as I know numpy.
Speaking of my lack of pandas knowledge, there is probably a way to apply the comparison function directly to a "join" of the dataframe with itself. I don't know how to do that though. If it exists, it will probably be faster than the nested loops I left mostly unchanged in the calling code (even if it doesn't optimize out the symmetric cases).

thanks for improving the code. is the min_interval referring to 'size of the smaller interval (rather than the size of the union)'? i updated ed1,ed2 in your code to et1,et2.. — kitchenprinzessin, Jan 09 '17 at 09:54
Yeah, that's what that part of the code is intended to do. I missed a few `ed` variables, so I've edited to fix those. — Blckknght, Jan 09 '17 at 22:58
thanks. i wonder if this case (discontinuity in similarity) might apply to {instant,instant} too. — kitchenprinzessin, Jan 10 '17 at 03:25
Sure, any comparison involving an instant is going to be all or nothing, since there's no partial overlapping that can happen with an instant. Two instants are either equal or they are not. One instant falls within an interval, or it falls outside (there's no part way). — Blckknght, Jan 10 '17 at 04:01
yes, instant is atomic..unless i allow like a buffer (+-range) to determine the sim between two instants. this depends on the use case. thanks again. — kitchenprinzessin, Jan 10 '17 at 04:19

score 0 · Answer 2 · answered Jan 09 '17 at 05:23

Not trying to give a complete answer here. I'd create a class to recieve the start and end time and then do all the null not null checks on the constructor. Later on that class I'll overload the comparison operators. Like eq for == and ne for != On the overloads you can try returning integer values instead of True and False. And with all that set I'll rewrite maybe like Instant( data[i][0], data[i][1] ) == Instant( data[j][0], data[j][1])

And derive what you need from there

Python - Compute similarities between items based on temporal information (instant, interval)

2 Answers2