My input is a string in this (spintax) format,
"The {PC|Personal Computer|Desktop} is in {good|great|fine|excellent} condition"
Then using itertools, I generate all possible combinations. e.g.
"The PC is in good condition"
"The PC is in great condition"
.
.
.
"The Desktop is in excellent condition"
Out of these strings, I only want to keep the most unique ones based on a similarity threshold, for e.g. only keep strings having similarity of less than 60%. I used SequenceMatcher library but it is does not work well with large data sets (250K+ items) due to looping. Here is the current implementation,
def filter_descriptions(descriptions):
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
i = 0
while i < len(descriptions):
print("Processing {}/{}...".format(i + 1, len(descriptions)))
desc_to_evaluate = descriptions[i]
j = i + 1
while j < len(descriptions):
similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
if similarity_ratio > MAX_SIMILAR_ALLOWED:
del descriptions[j]
else:
j += 1
i += 1
return descriptions
I am shortening the list (almost) every iteration, to speed up the process. But I definitely need a faster algorithm to tackle this. I tried Cosine Similarity too, but ran into scaling issues there. It worked ok for about 10K items, but above that it just stuck my machine. Here's the implementation,
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)
val = cosine_similarity(tfidf_matrix[:10000], tfidf_matrix[:10000])
Any optimized solution for this? All I want is to pick n most unique strings from the list.