Python - reduce complexity using sets

Question

I am using url_analysis tools from spotify API (wrapper spotipy, with sp.) to process tracks, using the following code:

def loudness_drops(track_ids):

names = set()
tids = set()
tracks_with_drop_name = set()
tracks_with_drop_id = set()

for id_ in track_ids:
    track_id = sp.track(id_)['uri']
    tids.add(track_id)
    track_name = sp.track(id_)['name']
    names.add(track_name)
    #get audio features
    features = sp.audio_features(tids)
    #and then audio analysis id
    urls = {x['analysis_url'] for x in features if x}
    print len(urls)
    #fetch analysis data
    for url in urls:
        # print len(urls)
        analysis = sp._get(url)
        #extract loudness sections from analysis
        x = [_['start'] for _ in analysis['segments']]
        print len(x)
        l = [_['loudness_max'] for _ in analysis['segments']]
        print len(l)
        #get max and min values
        min_l = min(l)
        max_l = max(l)
        #normalize stream
        norm_l = [(_ - min_l)/(max_l - min_l) for _ in l]
        #define silence as a value below 0.1
        silence = [l[i] for i in range(len(l)) if norm_l[i] < .1]
    #more than one silence means one of them happens in the middle of the track
    if len(silence) > 1:
        tracks_with_drop_name.add(track_name)
        tracks_with_drop_id.add(track_id)
return tracks_with_drop_id

The code works, but if the number of songs I search is set to, say, limit=20, the time it takes to process all the audio segments xand l makes the process too expensive, e,g:

time.time() prints 452.175742149

QUESTION:

how can I drastically reduce complexity here?

I've tried to use sets instead of lists, but working with set objects prohibts indexing.

EDIT: 10 urls:

[u'https://api.spotify.com/v1/audio-analysis/5H40slc7OnTLMbXV6E780Z', u'https://api.spotify.com/v1/audio-analysis/72G49GsqYeWV6QVAqp4vl0', u'https://api.spotify.com/v1/audio-analysis/6jvFK4v3oLMPfm6g030H0g', u'https://api.spotify.com/v1/audio-analysis/351LyEn9dxRxgkl28GwQtl', u'https://api.spotify.com/v1/audio-analysis/4cRnjBH13wSYMOfOF17Ddn', u'https://api.spotify.com/v1/audio-analysis/2To3PTOTGJUtRsK3nQemP4', u'https://api.spotify.com/v1/audio-analysis/4xPRxqV9qCVeKLQ31NxhYz', u'https://api.spotify.com/v1/audio-analysis/1G1MtHxrVngvGWSQ7Fj4Oj', u'https://api.spotify.com/v1/audio-analysis/3du9aoP5vPGW1h70mIoicK', u'https://api.spotify.com/v1/audio-analysis/6VIIBKYJAKMBNQreG33lBF']

Can you save a set of tuples of the form `(l,norml)` ? Or better yet, a `dict` of the form `{l:norm}` — juanpa.arrivillaga, Oct 19 '16 at 17:10
@data_garden You come here and ask us to help you with your work. It's in your interest to make it easy for people to read your code. — juanpa.arrivillaga, Oct 19 '16 at 17:12
I think this would benefit from being made more generic. Several things you're referring to seem only applicable to `spotify` and restrict people helping. — roganjosh, Oct 19 '16 at 17:13
@juanpa.arrivillaga sorry, the snippet is not mine, I'm only using it in the context of something else. — 8-Bit Borges, Oct 19 '16 at 17:17
How long does it take for that 452 seconds example if you remove all code after `analysis = sp._get(url)`, i.e., if you only do **that**? — Stefan Pochmann, Oct 19 '16 at 17:21
How many segments do you have? Honestly, it's hard to say if the bottleneck is `list` vs `set`. Indeed. Given 20 songs, I would guess that isn't your issue. You should run a profiler, as @JETM suggested. In other words, your issue isn't likely to be *asymptotic time complexity*, rather, you are in the weeds of constant factors. — juanpa.arrivillaga, Oct 19 '16 at 17:21
@juanpa.arrivillaga I don't know how to do that.... I have printed `x`and `l`, and the number of segments is **HUGE**. thats where is stucks. — 8-Bit Borges, Oct 19 '16 at 17:23
@StefanPochmann read my mind. I think the issue might be related to the `get` because if you're only getting 20 values back then the list comprehensions should be done super quick. If the bottleneck is the `get` then you could try using `requests` library, open a `Session` and try get the response into something that `sp` can understand. — roganjosh, Oct 19 '16 at 17:23
@data_garden Then **don't** print `x` and `l` but print `len(l)` and tell us that. — Stefan Pochmann, Oct 19 '16 at 17:26
Can you provide an example of a search that will take this long? I'm looking at the API docs and it seems pretty compact; enough so that we can replicate without tonnes of code. — roganjosh, Oct 19 '16 at 17:28
@StefanPochmann `len` 2073 2073 2501 2501 2073 2073 2098 2098 2501 2501 2073 2073 2098 2098 2501 2501 2008 2008 2073 2073 2098 2098 2071 2071 2501 2501 2008 2008 2073 2073 2071 2071 3731 3731 2073 2073 2501 2501 2098 2098 2008 2008 1707 1707 2071 2071 3731 3731 2073 2073 2501 2501 2098 2098 2008 2008 1707 1707 2177 2177 2073 2073 3731 3731 2071 2071 2501 2501 2098 2098 2008 2008 2608 2608` and counting — 8-Bit Borges, Oct 19 '16 at 17:30
Please see my previous comment. And also, how many URLs are you searching? — roganjosh, Oct 19 '16 at 17:32
@data_garden Huh? Those are way more than 20 values, and they're not really huge. Also, what about my earlier question about testing just that `sp._get(url)` line? — Stefan Pochmann, Oct 19 '16 at 17:33
@StefanPochmann I think it might be because the whole thing is encased in `for url in urls:` and there could be any number of lookups there. — roganjosh, Oct 19 '16 at 17:34
@data_garden wrong person, there's a few of us looking at this :) Please can you just give a repeatable example so that we can test ourselves? If `urls` is a gigantic list, give a selection of 10 and tell us how long your real list is. Then we can gauge roughly how long our changes take to execute overall. — roganjosh, Oct 19 '16 at 17:35
@roganjosh I deeply appreciate it! would you like me to edit it and display the whole code here? — 8-Bit Borges, Oct 19 '16 at 17:39
@data_garden please. The API looks compact so it shouldn't be hard to make a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) with 10 URLs so we can actually see the bottleneck ourselves. — roganjosh, Oct 19 '16 at 17:43
@roganjosh Judging by their [previous question](http://stackoverflow.com/q/39323185/1672429) that they "forgot" to tell us about as well as the answer there (where the code is from) and [this documentation](https://developer.spotify.com/web-api/get-audio-features/) I get the feeling that it's one URL per song. — Stefan Pochmann, Oct 19 '16 at 17:44
This isn't an MCVE because I asked twice for 10 URLs and something that we can copy/paste. You don't show how you call these functions. @StefanPochmann well at least we know where all the `_` comes from in the code I guess. — roganjosh, Oct 19 '16 at 17:48
@roganjosh I dont see what you mean. this is what you need to run it...sorry I'm a noob — 8-Bit Borges, Oct 19 '16 at 17:50
@data_garden so if I copy/paste this into my code editor and run it, I'll suddenly start searching spotify for songs? I suspect what will happen is that it will do nothing at all. I'm asking for some example `track_ids` that get passed to `loudness_drops`. At this point I'm thinking it's going to be near impossible to help you out on this. — roganjosh, Oct 19 '16 at 17:52
@data_garden That's what an MCVE is. I should be able to copy/paste and just click run to see the issue myself. — roganjosh, Oct 19 '16 at 17:54
@roganjosh there you go again. refere to edit. I hope this helps. thank you for you kindness and patiente. — 8-Bit Borges, Oct 19 '16 at 18:04

score 1 · Accepted Answer · answered Oct 19 '16 at 18:20

This is what I see, not knowing much about spotify:

for id_ in track_ids:
    # this runs N times, where N = len(track_ids)
    ...
    tids.add(track_id)  # tids contains all track_ids processed until now
    # in the end: len(tids) == N
    ...
    features = sp.audio_features(tids)
    # features contains features of all tracks processed until now
    # in the end, I guess: len(features) == N * num_features_per_track

    urls = {x['analysis_url'] for x in features if x}
    # very probably: len(urls) == len(features)

    for url in urls:
        # for the first track, this processes features of the first track only
        # for the seconds track, this processes features of 1st and 2nd
        # etc.
        # in the end, this loop repeats N * N * num_features_per_track times

You should not any url twice. And you do, because you keep all tracks in tids and then for each track you process everything in tids, which turns the complexity of this into O(n²).

In general, always look for loops inside loops when trying to reduce complexity.

I believe in this case this should work, if audio_features expects a set of ids:

# replace this: features = sp.audio_features(tids)
# with:
features = sp.audio_features({track_id})

@data_garden The next thing is to try running the profiler, as others suggested. See the example in [`profile.Profile`](https://docs.python.org/2/library/profile.html#profile.Profile). — zvone, Oct 19 '16 at 18:33

Python - reduce complexity using sets

1 Answers1