Let's clean it up a bit:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
That's redundant! Replace 4 statements with 2:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
Next:
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
I don't know what you think that code does. I hope you're not being tricky. But .keys
returns an iterable, and X in <iterable>
is WAY slower than X in <dict>
. Also, note: you don't append anything if the innermost (weights[W][word] != 0
) condition fails. That might be a bug, since you try to append 0 in another else condition. (I don't know what you're doing, so I'm just pointing it out.) And this is Python, not Perl or C or Java. So no parens required around if <test>:
Let's go with that:
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
Next:
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
You're trying to prevent dividing by zero. But you can use the truthiness of a list to check this, and avoid the comparison:
if words:
ratios = np.divide(ratios, float(len(words)))
The rest is fine, but you don't need the variable.
ratio = np.sum(ratios)
return ratio
With those mods applied, your function looks like this:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
Looking at it at little harder, I see you're doing this:
word_counts = Counter(words)
for word in words:
append( word_counts[word] * ...)
According to me, that means if "apple" appears 6 times, you are going to append 6*... to the list, once for each word. So you'll have 6 different occurrences of 6*... in your list. Are you sure that's what you want? Or should it be for word in word_counts
to just iterate over the distinct words?
Another optimization is to remove lookups from inside your loop. You keep looking up weights[W]
and rel_weight[W]
, even though the value of W
never changes. Let's cache those values outside the loop. Also, let's cache a pointer to the ratios.append
method.
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
# Cache these values for speed in loop
ratios_append = ratios.append
weights_W = weights[W]
rel_W = rel_weight[W]
for word in words:
if word in weights_W and word in rel_W:
if weights_W[word] != 0:
ratios_append(words_counts[word] * rel_W[word] / weights_W[word])
else:
ratios_append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
Try that, see how it works. Please look at the bold note above, and the questions. There might be bugs, there might be more ways to speed up.