I have this big string with products and some data:
big_string ="""Supply and installation of extraction unit type OZEO FLAT AUTO 2 or similar of
dimensions 500 mm x 460 mm x 185 mm ideal for installation in false ceiling of collective
housing with its corresponding CE marking and manufacturer's certificates group of
recyclable plastic material with 5 extraction outlets."""
I want to calculate matching score of some products like:
'OZEO FLAT AUTO 2V' or 'OZEO FLAT H 2'
I've done a similarity score using counting words. But the 2 products has the same score.
Actual output:
score(OZEO FLAT AUTO 2V, big_string)
[0.75]
score(OZEO FLAT H 2, big_string)
[0.75]
Expected output:
score(OZEO FLAT AUTO 2V, big_string)
[0.9]
score(OZEO FLAT H 2, big_string)
[0.75]
I found some similarity strings like Levenshtein or Jaro distance but works if the strings has the same length. Furthermore, my counting words doesn't work properly, cause sometimes counts words that aren't together.
Any thoughts?
My actually score counting words:
big_string ="""Supply and installation of extraction unit type OZEO FLAT AUTO 2 or similar of
dimensions 500 mm x 460 mm x 185 mm ideal for installation in false ceiling of collective
housing with its corresponding CE marking and manufacturer's certificates group of
recyclable plastic material with 5 extraction outlets."""
product = 'OZEO FLAT AUTO 2V'
words = product.split(" ")
score = 0.0
for word in words:
if len(re.findall("(?<!\S)" + word + "(?!\S)", big_string)) > 0:
score+=1
else:
pass
score = round(score/len(words),2)