If all names have the format PREFIX SUFFIX
, you can split the names and apply your sequence matcher first to prefixes, then to suffixes, and, packing the distances (let's say Levenshtein distance) back into tuples, you get:
1. ('SG', 'HOLDINGS') vs ('S2', 'HOLDINGS') → (1, 0)
2. ('SG', 'HOLDINGS') vs ('SG', 'Corp') → (0, 8)
3. ('SG', 'HOLDINGS') vs ('SG', 'HOLD') → (0, 4)
4. ('SG', 'HOLDINGS') vs ('S2', 'HOLDING') → (1, 1)
When you sort those tuples of distances in ascending order, the ordering will be [3, 2, 1, 4]
.
If the stock names contain a different number of words, you could count the words in the longest name (e.g. say the longest name is "Samsung Electronics Ord Shares"; it contains 4 words) and then extend all the other name-parts tuples with empty strings to match this length before computing the distances. I.e., you would be working with: ('SG', 'HOLDINGS', '', '')
.
The new distances:
1. ('SG', 'HOLDINGS', '', '') vs ('S2', 'HOLDINGS', '', '') → (1, 0, 0, 0)
2. ('SG', 'HOLDINGS', '', '') vs ('SG', 'Corp', '', '') → (0, 8, 0, 0)
3. ('SG', 'HOLDINGS', '', '') vs ('Samsung', 'E', 'O', 'S') → (6, 8, 1, 1)
now sort as [2, 1, 3]
.