Python string pattern recognition/compression

Question

I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be.

For example, I have a list of similar strings:

lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']

In this case the common pattern is two segments of common text: 'sometxt' and 'moretxt', starting and separated by something else that is variable in length.

The common string and variable string can of course occur at any order and at any number of occasions.

What would be a good way to condense/compress the list of strings into their common parts and individual variations?

An example output might be:

c = ['sometxt', 'moretxt']

v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]

This sounds hard. It might help if you gave more background, like telling why you must use exactly this compression scheme. — unwind, Dec 16 '09 at 12:05
wait, you don't know what `sometxt` and `moretxt` are and where they are? that would be quite a trick then — SilentGhost, Dec 16 '09 at 12:08
Sorry if I didn't make myself clear in the question, no I have no knowledge of what 'sometxt' and/or 'moretxt' going to be. The inputs are a series of user define texts and machine generated character sequences. I have to figure out what are the static strings and what are the dynamic sequences in each input item. — iCy, Dec 16 '09 at 14:49
How about looking for an external tool such as Frak: https://github.com/noprompt/frak — Christopher Oezbek, Dec 28 '19 at 14:20

codeape · Answer 1 · 2009-12-17T13:08:18.490

This solution finds the two longest common substrings and uses them to delimit the input strings:

def an_answer_to_stackoverflow_question_1914394(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> an_answer_to_stackoverflow_question_1914394(lst)
    (['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
    """
    delimiters = find_delimiters(lst)
    return delimiters, list(split_strings(lst, delimiters))

find_delimiters and friends finds the delimiters:

import itertools

def find_delimiters(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> find_delimiters(lst)
    ['sometxt', 'moretxt']
    """
    candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
    if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
        raise ValueError("Unable to find useful delimiters")
    if candidates[1] in candidates[0]:
        raise ValueError("Unable to find useful delimiters")
    return candidates[0:2]

def find_longest_common_substrings(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(itertools.islice(find_longest_common_substrings(lst), 3))
    ['sometxt', 'moretxt', 'sometx']
    """
    for i in xrange(min_length(lst), 0, -1):
        for substring in common_substrings(lst, i):
            yield substring


def min_length(lst):
    return min(len(item) for item in lst)

def common_substrings(lst, length):
    """
    >>> list(common_substrings(["hello", "world"], 2))
    []
    >>> list(common_substrings(["aabbcc", "dbbrra"], 2))
    ['bb']
    """
    assert length <= min_length(lst)
    returned = set()
    for i, item in enumerate(lst):
        for substring in all_substrings(item, length):
            in_all_others = True
            for j, other_item in enumerate(lst):
                if j == i:
                    continue
                if substring not in other_item:
                    in_all_others = False
            if in_all_others:
                if substring not in returned:
                    returned.add(substring)
                    yield substring

def all_substrings(item, length):
    """
    >>> list(all_substrings("hello", 2))
    ['he', 'el', 'll', 'lo']
    """
    for i in range(len(item) - length + 1):
        yield item[i:i+length]

split_strings splits the strings using the delimiters:

import re

def split_strings(lst, delimiters):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(split_strings(lst, find_delimiters(lst)))
    [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
    """
    for item in lst:
        parts = re.split("|".join(delimiters), item)
        yield tuple(part for part in parts if part != '')

Have an upvote to counteract (and a bit) the random driveby. There's some interesting and useful stuff in your answer. — mavnn, Dec 16 '09 at 13:18
We might be using a different version of python, 2.4.3 is what I am using. I got TypeError: min() takes no keyword arguments. — iCy, Dec 17 '09 at 10:53
Yes, I'm on 2.5. I edited the answer to make it work with 2.4 — codeape, Dec 17 '09 at 13:08

score 3 · Answer 2 · answered Dec 16 '09 at 12:31

3

Here is a scary one to get the ball rolling.

>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')

I hope its sheer ugliness will inspire better solutions :)

answered Dec 16 '09 at 12:31

Constantin

27,478
10
60
79

Beautiful solution, a bit slow, but I can wait. Ugliness doesn't bother me, good code worth more than gold. Never thought of using regex that way, but this is way beyond my basic regex knowledge. I am trying to figure out how to separate the output into separate list of constants and variables. – iCy Dec 16 '09 at 14:40
2

@iCy, it works significantly faster if there is a character that is never encountered in input data, like '\0'. In that case replace both ''.join with '\0'.join. It also gives a prettier result. – Constantin Dec 16 '09 at 15:34
Brilliant! Yea, the strings only contain printable characters. '\0'.join made it much faster. – iCy Dec 16 '09 at 15:46
Yes, that speeds it up at a great deal. My solution is still 200 times faster, though :-) Having more test cases would be nice... – codeape Dec 16 '09 at 21:16

Ber · Answer 3 · 2009-12-18T09:20:27.617

2

This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need.

I assume you have no a priori knowledge of these sub strings that repeat often.

edited Dec 18 '09 at 09:20

answered Dec 16 '09 at 12:46

Ber

40,356
16
72
88

score 2 · Answer 4 · answered Dec 16 '09 at 12:48

2

This seems to be an example of the longest common subsequence problem. One way could be to look at how diffs are generated. The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic.

The first link contains detailed discussion and (pseudo) code examples. Assuming, of course, Im not completely of the track here.

answered Dec 16 '09 at 12:48

Mizipzor

51,151
22
97
138

Yea, I thought about the diff alg. as well, just have no idea where to start. – iCy Dec 16 '09 at 14:18

score 1 · Answer 5 · answered Dec 16 '09 at 12:30

I guess you should start by identifying substrings (patterns) that frequently occur in the strings. Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart.

I've done substring counting on a large amount of data using generalized suffix trees (example here). Once you know the most frequent substrings/patterns in the data, you can take it from there.

score -1 · Answer 6 · answered Dec 16 '09 at 12:14

-1

How about subbing out the known text, and then splitting?

import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

answered Dec 16 '09 at 12:14

EMiller

1,138
8
23

3

OP doesn't know what `sometxt` and `moretxt` are – SilentGhost Dec 16 '09 at 12:16
@SilentGhost I'm not sure if he does or not, to be honest. The question leaves it pretty unclear. – mavnn Dec 16 '09 at 12:19
@mavnn: Let's read it again: *namely I don't know what the pattern is going to be.* – SilentGhost Dec 16 '09 at 12:21
@SilentGhost: Given it follows _I can do basic regex..._ it could be read as _I don't know how to construct the regex pattern needed_ rather than _I don't know the pattern of sometxt and moretxt_. It's a much more interesting question in the latter case, though. – mavnn Dec 16 '09 at 12:26
This is what re.split is for. – codeape Dec 16 '09 at 13:11
Yeah I thought of re.split first - but it leaves sometxt & moretxt in the result. re.sub + split seemed like fewer steps to get the results requested. – EMiller Dec 16 '09 at 13:50

Python string pattern recognition/compression

6 Answers6

Linked