Python speed up document similarity calculation of corpus

Question

My input is a string in this (spintax) format,

"The {PC|Personal Computer|Desktop} is in {good|great|fine|excellent} condition"

Then using itertools, I generate all possible combinations. e.g.

"The PC is in good condition"
"The PC is in great condition"
.
.
.
"The Desktop is in excellent condition"

Out of these strings, I only want to keep the most unique ones based on a similarity threshold, for e.g. only keep strings having similarity of less than 60%. I used SequenceMatcher library but it is does not work well with large data sets (250K+ items) due to looping. Here is the current implementation,

def filter_descriptions(descriptions):
    MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar
    i = 0
    while i < len(descriptions):
        print("Processing {}/{}...".format(i + 1, len(descriptions)))
        desc_to_evaluate = descriptions[i]
        j = i + 1
        while j < len(descriptions):
            similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
            if similarity_ratio > MAX_SIMILAR_ALLOWED:
                del descriptions[j]
            else:
                j += 1
        i += 1
    return descriptions

I am shortening the list (almost) every iteration, to speed up the process. But I definitely need a faster algorithm to tackle this. I tried Cosine Similarity too, but ran into scaling issues there. It worked ok for about 10K items, but above that it just stuck my machine. Here's the implementation,

from sklearn.metrics.pairwise import cosine_similarity
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)
val = cosine_similarity(tfidf_matrix[:10000], tfidf_matrix[:10000])

Any optimized solution for this? All I want is to pick n most unique strings from the list.

Not a solution, but I think that you have a bug there. You shouldn't increase j after deleting descriptions[j]. — Uri Hoenig, Oct 23 '18 at 10:33
I might be missing something, but why do you begin all over again after finding a similar desc? Why not leave j as is after deleting descriptions[j]? — Uri Hoenig, Oct 23 '18 at 10:38
Fixed...thanks for identifying that, no wonder it was taking a lot of time :) — Mujeeb, Oct 23 '18 at 10:40
But I believe there's still a better approach out there...somewhere — Mujeeb, Oct 23 '18 at 11:45

IAmBullsaw · Answer 1 · 2018-10-24T07:38:35.413

One thing that could be optimised is your use of del. Now you perform del a lot of times and altough I do not know how Python handles this, I think a solution using one del-statement is better, since I believe Python must create a new list for every del performed.

So I decided to test this approach:

import time
import argparse

def test1(long_list, max_num):
    """
    Removing values from a list with delete every step in the loop
    """
    i = 0
    while i < len(long_list):
        if long_list[i] > max_num:
            del long_list[i]
        else:
            i += 1
    return long_list


def test2(long_list, max_num):
    """
    Removing values from a list with delete, lastly after swapping values into the back of the array - marked as garbage
    """
    garbage_index = len(long_list) - 1
    i = 0
    while i <= garbage_index:
        if long_list[i] > max_num:
            long_list[i],long_list[garbage_index] =  long_list[garbage_index], long_list[i]
            garbage_index -= 1
        else:
            i += 1

    del long_list[garbage_index + 1 :]
    return long_list


def get_args():
    """
    Fetches needed arguments for test1() and test2()
    """
    parser = argparse.ArgumentParser()
    parser.add_argument("list_size", help="Set the size of the list.", type=int)
    parser.add_argument("max_element", help="Set max-element value.", type=int)

    return parser.parse_args()


if __name__ == '__main__':
    """
    Simply times the two test functions and prints the time difference
    """
    args = get_args()
    long_list = [x for x in range(args.list_size) ]
    print("Using list size {}".format(args.list_size))

    start = time.time()
    test1(long_list, args.max_element)
    end1 = time.time()
    test2(long_list, args.max_element)
    end2 = time.time()

    print("test1:",end1-start)
    print("test2:",end2-end1)

And got some interesting results for you:

$ python3 Code/Playground/stackoverflow/pyspeedup.py 10 5
Using list size 10
test1: 4.5299530029296875e-06
test2: 2.384185791015625e-06
$ python3 Code/Playground/stackoverflow/pyspeedup.py 100 50
Using list size 100
test1: 1.71661376953125e-05
test2: 5.9604644775390625e-06
$ python3 Code/Playground/stackoverflow/pyspeedup.py 1000 500
Using list size 1000
test1: 0.00022935867309570312
test2: 4.506111145019531e-05
$ python3 Code/Playground/stackoverflow/pyspeedup.py 10000 5000
Using list size 10000
test1: 0.006038665771484375
test2: 0.00046563148498535156
$ python3 Code/Playground/stackoverflow/pyspeedup.py 100000 5000
Using list size 100000
test1: 2.022616386413574
test2: 0.0004937648773193359
$ python3 Code/Playground/stackoverflow/pyspeedup.py 1000000 5000
Using list size 1000000
test1: 224.23923707008362
test2: 0.0005621910095214844
$ python3 Code/Playground/stackoverflow/pyspeedup.py 10000000 5000
Using list size 10000000
test1: 43293.87373256683
test2: 0.0005309581756591797

The test2() solution also doesn't create a new garbage list, but uses in memory swap within the same list - thus saving space and time.

Hope this helps towards a more optimized algorithm.

Python speed up document similarity calculation of corpus

1 Answers1