1

I have a large-ish list of strings (no more than 2k) and I want to find the most common partial string match within the list. For example I'm trying to satisfy the following test case in an efficient manner.

data = [
    'abcdef',
    'abcxyz',
    'xyz',
    'def',
]
result = magic_function(data)
assert result == 'abc'

I've tried this with inspiration from this stackoverflow post, but the fact that some elements in the list are completely different throws it off.

def magic_function(data):
    return ''.join(c[0] for c in takewhile(lambda x: all(x[0] == y for y in x), zip(*data)))
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
user5038859
  • 133
  • 1
  • 2
  • 11

1 Answers1

1

You probably have to tweak this and perfomance test it.

I essentially feed all partial substrings up to the lenght of each word in data into a Counter and create a ranking based on len(substring)*occurence - penalizing occurences of only 1 by multiplying wiht 0.1:

data = [
    'abcdef',
    'abcxyz',
    'xyz',
    'def',
]    

def magic(d):
    """Applies magic(tm) to the list of strings given as 'd'.
    Returns a list of ratings which might be the coolest substring."""
    from collections import Counter
    myCountings = Counter()

    def allParts(word):
        """Generator that yields all possible word-parts."""
        for i in range(1,len(word)):
            yield word[:i]

    for part in d:
        # count them all
        myCountings.update(allParts(part))

    # get all as tuples and sort based on heuristic length*occurences
    return sorted(myCountings.most_common(), 
                  key=lambda x:len(x[0])*(x[1] if x[1] > 1 else 0.1), reverse=True)

m = magic(data)    
print( m ) # use  m[0][0] f.e. 

Output:

 [('abc', 2), ('ab', 2), ('a', 2), ('abcde', 1), ('abcxy', 1), 
  ('abcd', 1), ('abcx', 1), ('xy', 1), ('de', 1), ('x', 1), ('d', 1)]

You would have to tweak the sorting criteria a bit and only use the first in the resulting list - but you can use that as starter.

Tweaking could be done by multiplying the length by a faktor if you prefer longer ones over multiple short ones - that depends on your data...

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • Cool, this seems to work well! I'm going do so some profiling to see if it can be improved performance-wise, but I think this will work for now. Thanks! – user5038859 Jul 31 '18 at 20:58