0

I am trying to find closest match words from a list of stock name and I wan to put more priority to the word in front instead of word at back though the word at back may have more chars.

Eg.

"SG HOLDINGS" vs "S2 HOLDINGS"

sequence matcher will show these two words to have higher similarity ratio compared to "SG HOLDING" vs "SG Corp", however the latter one is actually the company i am looking for. How can i put more weight to the word in front of a stock name? Is there any other lib i can use?

Thanks

Grace
  • 1

1 Answers1

0

If all names have the format PREFIX SUFFIX, you can split the names and apply your sequence matcher first to prefixes, then to suffixes, and, packing the distances (let's say Levenshtein distance) back into tuples, you get:

1. ('SG', 'HOLDINGS') vs ('S2', 'HOLDINGS') → (1, 0)
2. ('SG', 'HOLDINGS') vs ('SG', 'Corp')     → (0, 8)
3. ('SG', 'HOLDINGS') vs ('SG', 'HOLD')     → (0, 4)
4. ('SG', 'HOLDINGS') vs ('S2', 'HOLDING')  → (1, 1)

When you sort those tuples of distances in ascending order, the ordering will be [3, 2, 1, 4].

If the stock names contain a different number of words, you could count the words in the longest name (e.g. say the longest name is "Samsung Electronics Ord Shares"; it contains 4 words) and then extend all the other name-parts tuples with empty strings to match this length before computing the distances. I.e., you would be working with: ('SG', 'HOLDINGS', '', '').

The new distances:

1. ('SG', 'HOLDINGS', '', '') vs ('S2', 'HOLDINGS', '', '') → (1, 0, 0, 0)
2. ('SG', 'HOLDINGS', '', '') vs ('SG', 'Corp', '', '')     → (0, 8, 0, 0)
3. ('SG', 'HOLDINGS', '', '') vs ('Samsung', 'E', 'O', 'S') → (6, 8, 1, 1)

now sort as [2, 1, 3].

K3---rnc
  • 6,717
  • 3
  • 31
  • 46
  • Hi, the stock names may contain multiple word. Eg "Samsung Electronics Ord Shares", is there a way i can determine based on dynamic word length? thanks – Grace Jul 06 '18 at 01:59
  • thanks K3-mc, I might have not raised the question correctly. What i am trying to say is that there are examples such as "Samsung Electronics Ord Shares" vs "Samsung Electronics Pref Shares". How do I count distance and order them programmatically? Is there a package i can use? Thanks – Grace Jul 06 '18 at 03:27
  • Following the above-described method, the two strings ("Samsung Electronics Ord Shares" vs "Samsung Electronics Pref Shares") would have a distance of `(0, 0, 4, 0)` and would sort accordingly. But you'd have to code it yourself. The whole thing comes in under 20 lines. – K3---rnc Jul 06 '18 at 03:31