Matching overlapping Intervals (Dictionary) in Python

Question

So, I was trying to solve some random problem via python and got stuck on the logic. Here's the problem : I have multiple videos along with their run time or running length. Now, I want to maintain 2 lists. 1 list if "synced" and the other one is "non synced". We decide "synced", if the difference of run time of the streams is less than or equal to 2 seconds. Otherwise, they're not synced. If we have multiple streams that match, then we take the streams with the highest match count/number.

I was able to come up with a very simple/slow method to part and pair these files. However, my logic failed when I got a different data set.

This is what I have written :

from datetime import datetime


# same_streams_old = {
  "Stream_1": "0:24:08.925167",
  "Stream_2": "0:24:08.990644",
  "Stream_3": "0:24:08.990644",
  "Stream_4": "0:24:12.118778",
  "Stream_5": "0:24:12.118778",
  "stream_6": "0:24:10.075066"
}
same_streams = {
  "Stream_1": "0:24:08.925167",
  "Stream_2": "0:24:12.118778",
  "Stream_3": "0:23:11.057711",
  "Stream_4": "0:24:12.118778",
  "Stream_5": "0:24:10.075066",
  "Stream_6": "0:24:08.990644"
}

keys = []
values = []
final_synced_video_files = []
final_non_synced_video_files = []

def get_time_diff(episode_run_time, episode_time):
    prev_episode_time = datetime.strptime(episode_run_time, '%H:%M:%S.%f')
    current_episode_time = datetime.strptime(episode_time, '%H:%M:%S.%f')

    time_diff = prev_episode_time - current_episode_time
    if current_episode_time > prev_episode_time:
        time_diff = current_episode_time - prev_episode_time

    return float(time_diff.seconds)

for key, value in same_streams.items():
    keys.append(key)
    values.append(value)
for key in keys:
    for _key in keys:
        if key != _key:
            diff = get_time_diff(same_streams[key], same_streams[_key])
            if diff <= 1.5:
                final_synced_video_files.append(key)
            else:
                pass

final_synced_video_files = list(set(final_synced_video_files))
final_non_synced_video_files = list(set(keys) - set(final_synced_video_files))

print("Synced Files : {0}".format(final_synced_video_files))
print("Non Synced Files : {0}".format(final_non_synced_video_files))

REPL Link

As you can see that the streams with most matches are stream_1, stream_2, stream_3 and stream_6.

What I've written doesn't compare the maximum counts yet. However, as I work on this, I feel like this is not really effective and a good way to solve this. Any inputs anyone?

I tried some approach on overlapping intervals and then got this : REPL LINK

But, if you run see both the same_streams dictionary, you'll see results are not what I'm trying to achieve. Any help with this would be great.

EDIT: I need to get the streams that have difference of 2 seconds with each other. For example : In same_streams_old, desired result would be streams 1,2,3 & 6. However, in the dictionary same_streams, desired result is streams 2,4 & 5.

Basically, I need to see which streams can be "muxed" together and which ones can't be muxed.

Can you clarify what it means to be *synced*? If a video is in the synced list, does it have to be within 2 second of **all** other vids in the list, or just at least one? Or do you mean within 2 seconds of the current (real) time? Please, also explain what the **match count** is. — Wololo, Dec 18 '19 at 07:54
I just need to check which streams fall into each other's 2 second mark. I'll explain a little more in the question. — Xonshiz, Dec 18 '19 at 08:24

Wololo · Accepted Answer · 2020-01-14T14:38:18.357

Ok, so the print routines below might be a little messy, but bear with me, they are only for debugging and you might not need them when done. I know this is a long answer, but please read it carefully ..

SHORT ANSWER:

Your problem might come from the fact that you lose the decimal precision when you convert from string to datetime, which treats seconds as integers. However, the timedelta object has a method named total_seconds() which does provide sub-second resolution. See this or the general docs for details. Simply change the return statement of get_time_diff() to,

 return float(time_diff.total_seconds())

MOTIVATION FOR LONG ANSWER:

I'm not sure what you're trying to achieve with the (non- and) synced list(s): You may have a situation that stream a is synced with stream b, while stream c is synced with d, but c and d are not synced with a and b. Should they all be in your synced_list? Depending on what you want to do with the lists, I would consider using the synced matrix described below, instead of your list(s) as they lose much information.

LONG ANSWER:

Let me introduce the concept of a Synced Matrix. It will give a complete description of which of your streams are synced with each other:

THE SYNC MATRIX: A symmetric matrix; Cell (i,j) in the matrix is TRUE if, and only if, stream 'i' and 'j' are in sync. Else, the cell value is FALSE. Hence, the diagonal (.) is entirely TRUE because a stream is always in sync with itself.
     1 2 3 4
    ________
 1 | . T T F
 2 |   . T F
 3 |     . F
 4 |       .
"T" is true, and "F" is false: obviously from the example drawing above, stream 1 is in sync with stream 2, but not in sync with stream 4.

The creation of such a Synced Matrix is quite simple for your example:

def is_synced(key_1, key_2):    
    max_allowed_desync = 1.5
    return max_allowed_desync > get_time_diff(same_streams[key_1], same_streams[key_2])

keys = same_streams.keys()
keys.sort() # VERY IMPORTANT, for the synced matrix to be constructed correctly; also make 's' uppercase for "stream_6" in OP.

# The complete matrix ..
full_sync_matrix = [[is_synced(k1,k2) for k2 in keys] for k1 in keys]

# We can optimize (memory usage) to only get the half matrix, since it's symmetric anyway; also excluding the diagonal.
half_sync_matrix = [[is_synced(k1,k2) for k2 in keys[curr+1:]] for curr,k1 in enumerate(keys)]

Now, let's implement two functions for printing/displaying the synced matrix:

# Print a HALFED sync matrix
def print_sync_half_matrix(sm):
    string = ""
    for i,row in enumerate(sm):
        string += "\n" + " "*i*2
        for col in row:
            string += " " + ("T" if col else "F")
    print(string)

# Print a COMPLETE sync_matrix
def print_sync_full_matrix(sm):
    string = ""
    for row in sm:
        string += "\n"
        for col in row:
            string += " " + ("T" if col else "F")
    print(string)

Then, for the data set you provided I get:

same_streams = {
  "Stream_1": "0:24:08.925167",
  "Stream_2": "0:24:08.990644",
  "Stream_3": "0:24:08.990644",
  "Stream_4": "0:24:12.118778",
  "Stream_5": "0:24:12.118778",
  "Stream_6": "0:24:10.075066"
} # note that "Stream_6" previously had a lower case 's'!

print_sync_half_matrix(half_sync_matrix)
#   1 2 3 4 5 6
# 1   T T F F T
# 2     T F F T
# 3       F F T
# 4         T F
# 5           F

Remember that the diagonal is not included in the matrix/print! The result here is correct and as expected from the input. Let us print out some time differences to get more intelligence,

for stream_key in same_stream:
    print("Stream_1 ~ "+stream_key+": "+str(get_time_diff(same_streams["Stream_1"], same_streams[stream_key])))

.. which quickly reveals that your time stamps has lost its decimal precisions:

Stream_1 ~ Stream_5: 3.0
Stream_1 ~ Stream_4: 3.0
# ...

If we look into the documentation of datetime we see that it treats time as an integer of seconds. Hence, the microseconds precision is lost when you request seconds from the datetime object in the get_time_diff function. Simply solved by requesting seconds from the deltatime method .total_seconds() instead..

I can live by dropping the microseconds. I'm not at home right now. Will try to understand this and implement in my code later to get the final synced streams.Thanks, I'll check it out in a while. — Xonshiz, Dec 18 '19 at 10:51
@Xonshiz Actually, you get the fraction of the second part, if you use the `totals_seconds()` method, instead of the `seconds` member variable. I've updated the answer .. — Wololo, Dec 18 '19 at 11:35
Hey! Just went through your code and it's more or less what I wanted to achieve. So, I'll take your code and get the data I'll be needing to make "sync pairs" and then check it. Will accept the answer soon and also will add a complete solution to what I wanted to achieve. Thanks a ton! This gave me a direction to work in. — Xonshiz, Dec 18 '19 at 14:52
Thanks for the direction. I took your approach and some of the code and changed it to suit my requirements and did what I wanted. You can check it here : https://repl.it/@Xonshiz/DecentWarpedActiveserverpages . I'll put up a new answer with the final code so I can mark it as answer. If you have any suggestions to make my final code any better, please update your answer and I'll mark it as answer. THANKS AGAIN! — Xonshiz, Dec 18 '19 at 17:34
Happy I could help. I recommend signing up for [StackExchange Code Review](https://codereview.stackexchange.com/) ... where you can post code that runs correctly, but get help on optimizing time and memory consumption, improve readability of your code, and more. — Wololo, Dec 19 '19 at 07:06
.. I'm still unable to figure out your definition of synchronization, though. In the last *repl* that you posted you have streams with (approx) times `[8.9, 12.1, 11.1, 12.1, 10.1, 8.9] ]`, and your script concludes that stream 1, 5 and 6 are in sync, while 3, 2 and 4 is not. Can explain why? As far as I can see in this example, stream 2 is in sync with stream 4. Regarding the code itself, I would just recommend writing a separate function for building the `sync_list`: not do it inside the print_function. — Wololo, Dec 19 '19 at 07:41

Matching overlapping Intervals (Dictionary) in Python

1 Answers1