0

I have this big string with products and some data:

big_string ="""Supply and installation of extraction unit type OZEO FLAT AUTO 2 or similar of
dimensions 500 mm x 460 mm x 185 mm ideal for installation in false ceiling of collective 
housing with its corresponding CE marking and manufacturer's certificates group of 
recyclable plastic material with 5 extraction outlets."""

I want to calculate matching score of some products like:

'OZEO FLAT AUTO 2V' or 'OZEO FLAT H 2' 

I've done a similarity score using counting words. But the 2 products has the same score.

Actual output:

score(OZEO FLAT AUTO 2V, big_string) 
[0.75]

score(OZEO FLAT H 2, big_string) 
[0.75]

Expected output:

score(OZEO FLAT AUTO 2V, big_string) 
[0.9]

score(OZEO FLAT H 2, big_string) 
[0.75]

I found some similarity strings like Levenshtein or Jaro distance but works if the strings has the same length. Furthermore, my counting words doesn't work properly, cause sometimes counts words that aren't together.

Any thoughts?

My actually score counting words:

big_string ="""Supply and installation of extraction unit type OZEO FLAT AUTO 2 or similar of
dimensions 500 mm x 460 mm x 185 mm ideal for installation in false ceiling of collective 
housing with its corresponding CE marking and manufacturer's certificates group of 
recyclable plastic material with 5 extraction outlets."""

product = 'OZEO FLAT AUTO 2V'
words = product.split(" ")
score = 0.0
for word in words: 
   if len(re.findall("(?<!\S)" + word + "(?!\S)", big_string)) > 0:
      score+=1
   else:
      pass
score = round(score/len(words),2) 


Dani98
  • 11
  • 4
  • What exactly is your question? – mkrieger1 Oct 07 '22 at 10:48
  • How can i get a better similarity score of products in big strings with python. For example getting score of 0.8-0.9 with the product "OZEO FLAT AUTO 2V" with the big string. – Dani98 Oct 07 '22 at 10:54
  • You could multiply the resulting scores by 1.2 to make the numbers bigger. – mkrieger1 Oct 07 '22 at 10:55
  • But I only want a bigger score when the product has better similarity with the big string. 'OZEO FLAT AUTO 2V' shoud have 0.8-0.9 but not the product 'OZEO FLAT H 2'. – Dani98 Oct 07 '22 at 11:00
  • I see. You would have to modify the `score` function somehow. We can't see it so we can't suggest what you could change. – mkrieger1 Oct 07 '22 at 11:02
  • I added my actually score function. – Dani98 Oct 07 '22 at 11:06

1 Answers1

0

Try using the SequenceMatcher;

big_string ="""Supply and installation of extraction unit type OZEO FLAT AUTO 2 or similar of
dimensions 500 mm x 460 mm x 185 mm ideal for installation in false ceiling of collective 
housing with its corresponding CE marking and manufacturer's certificates group of 
recyclable plastic material with 5 extraction outlets."""

from difflib import SequenceMatcher
print(SequenceMatcher(None, big_string, 'OZEO FLAT AUTO 2V').ratio()*10)
print(SequenceMatcher(None, big_string, 'OZEO FLAT H 2').ratio()*10)

# Output;
0.9846153846153846     # with 'OZEO FLAT AUTO 2V'
0.7476635514018691     # with 'OZEO FLAT H 2'
Sachin Kohli
  • 1,956
  • 1
  • 1
  • 6
  • This works in this 2 examples. But you use a x10 to get the correct answer. When i use it in different products and strings it is not correct. So it resolves but cause is hardcoded. – Dani98 Oct 10 '22 at 09:06
  • Probably... you should not have the x10 multiply factor... & you can scroll through this post... there are so many ways you can do the similar string matching... so have a look & try them out & take your best pick... hope that might helps... https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings – Sachin Kohli Oct 10 '22 at 09:11