You probably have to tweak this and perfomance test it.
I essentially feed all partial substrings up to the lenght of each word in data
into a Counter
and create a ranking based on len(substring)*occurence
- penalizing occurences of only 1 by multiplying wiht 0.1:
data = [
'abcdef',
'abcxyz',
'xyz',
'def',
]
def magic(d):
"""Applies magic(tm) to the list of strings given as 'd'.
Returns a list of ratings which might be the coolest substring."""
from collections import Counter
myCountings = Counter()
def allParts(word):
"""Generator that yields all possible word-parts."""
for i in range(1,len(word)):
yield word[:i]
for part in d:
# count them all
myCountings.update(allParts(part))
# get all as tuples and sort based on heuristic length*occurences
return sorted(myCountings.most_common(),
key=lambda x:len(x[0])*(x[1] if x[1] > 1 else 0.1), reverse=True)
m = magic(data)
print( m ) # use m[0][0] f.e.
Output:
[('abc', 2), ('ab', 2), ('a', 2), ('abcde', 1), ('abcxy', 1),
('abcd', 1), ('abcx', 1), ('xy', 1), ('de', 1), ('x', 1), ('d', 1)]
You would have to tweak the sorting criteria a bit and only use the first in the resulting list - but you can use that as starter.
Tweaking could be done by multiplying the length by a faktor if you prefer longer ones over multiple short ones - that depends on your data...