how to make this function more (time) efficient?

Question

i ve got a dataframe series which contain sentences. (some are kind of long)

i ve also got 2 dictionaries which contain words as keys and ints as count.

Not all words from strings are present in both dictionaries. Some are in only one, some are in neither.

Dataframe is 124011 units long. function is taking me about 0.4 per string. which is waaaay to long.

W is just a reference value for the dictionary (weights = {}, weights[W] = {})

here is the function:

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    ratios = []

    for word in words:

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)

    return ratio

thx

I suggest removing the double spacing and the duplicate code. Then add one or more example function calls, including some edge cases, with timings. Can you replace use of `np` to make code self-contained? — Terry Jan Reedy, Apr 02 '17 at 23:31

score 1 · Answer 1 · edited May 23 '17 at 12:17

1

I think that your time inefficiency may be coming from the fact you are using Counter instead of the dict. Some discussion here suggests that the dict class has parts written in pure c, while counter is written in python.

I suggest altering your code to using a dict and test to see if that provides a faster time

Also why is this code duplicated?:

words = string.split()

words_counts = Counter(words)

words = string.split()

words_counts = Counter(words)

ratios = []

edited May 23 '17 at 12:17

Community

1
1

answered Apr 02 '17 at 23:57

kopo222

119
11

ty. bad copy and paste. kind of sleepy. but it doesnt affect function process time at all – epattaro Apr 03 '17 at 00:05

score 1 · Accepted Answer · answered Apr 03 '17 at 00:09

Let's clean it up a bit:

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    words = string.split()

    words_counts = Counter(words)

That's redundant! Replace 4 statements with 2:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)

    ratios = []

    for word in words:    

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

I don't know what you think that code does. I hope you're not being tricky. But .keys returns an iterable, and X in <iterable> is WAY slower than X in <dict>. Also, note: you don't append anything if the innermost (weights[W][word] != 0) condition fails. That might be a bug, since you try to append 0 in another else condition. (I don't know what you're doing, so I'm just pointing it out.) And this is Python, not Perl or C or Java. So no parens required around if <test>:

Let's go with that:

    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

You're trying to prevent dividing by zero. But you can use the truthiness of a list to check this, and avoid the comparison:

    if words:
        ratios = np.divide(ratios, float(len(words)))

The rest is fine, but you don't need the variable.

    ratio = np.sum(ratios)

    return ratio

With those mods applied, your function looks like this:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

Looking at it at little harder, I see you're doing this:

word_counts = Counter(words)

for word in words:
    append(   word_counts[word] * ...)

According to me, that means if "apple" appears 6 times, you are going to append 6*... to the list, once for each word. So you'll have 6 different occurrences of 6*... in your list. Are you sure that's what you want? Or should it be for word in word_counts to just iterate over the distinct words?

Another optimization is to remove lookups from inside your loop. You keep looking up weights[W] and rel_weight[W], even though the value of W never changes. Let's cache those values outside the loop. Also, let's cache a pointer to the ratios.append method.

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    # Cache these values for speed in loop
    ratios_append = ratios.append
    weights_W = weights[W]
    rel_W = rel_weight[W]

    for word in words:
        if word in weights_W and word in rel_W:
            if weights_W[word] != 0:    
                ratios_append(words_counts[word] * rel_W[word] / weights_W[word])

        else:
            ratios_append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

Try that, see how it works. Please look at the bold note above, and the questions. There might be bugs, there might be more ways to speed up.

you sir, are a blessing. ty alot for the solution and the great explanation. ps: nice catch on "for word in word_counts" — epattaro, Apr 03 '17 at 00:12
Did it work? And did you answer the questions (bugs, append, counts)? — aghast, Apr 03 '17 at 00:13
yes. it worked great. went from 10 hours function to 10 seconds. the ratio.append pointer was a new one for me, nice trick. — epattaro, Apr 03 '17 at 00:18
Looking at it, you can probably distribute the division. Just compute the np.sum and then return ratio/len(words). — aghast, Apr 03 '17 at 00:21
Or, if `np` is short for `numpy`, you could just return `numpy.mean` of ratios. — aghast, Apr 03 '17 at 00:23

score 1 · Answer 3 · answered Apr 03 '17 at 00:12

It would be good if you had a profile of that function execution, but here are some generic ideas:

You needlessly get some elements on every iteration. You can extract these before the loop

For example

weights_W = weights[W]
rel_weights_W = rel_weights[W]

You don't need to call .keys() on dicts.

These are equivalent:

word in weights_W.keys()
word in weights_W

Attempt to get values without looking them up first. This will save you one lookup.

For example instead of:

if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
        if (weights[W][word]!=0):

you can do:

word_weight = weights_W.get(word)
if word_weight is not None:
    word_rel_weight = rel_weights_W.get(word)
    if word_rel_weight is not None:
        if word_weight != 0:  # lookup saved here

how to make this function more (time) efficient?

3 Answers3