Finding an unknown pattern in a string python

Question

I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.

My problem is next. I get a string of characters e.g

'1211' and what I need to do is see that 1 is most often repeated and this 2 times in a row.

But it can also be "121212112" where 12 is repeated 3 times in a row.

But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.

here are some results I like to get (the only numbers ever used are 1 and 2's)

>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2

the outcome I want is how many times it occurs.

I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.

Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.

Can you describe the exact behaviour you want? For example, why is “12121212” to be treated as “1212” twice in a row, rather than as “12” four times or “12121212” once? — Ben, Sep 01 '18 at 13:17
Are you always looking to get the longest substring? Let's take `12121212`, can that not be `12` four times? — Paolo, Sep 01 '18 at 13:24
Also, in order to implement something reasonable, you will have to know some contraints, e.g. the length of the longest expected input string, the longest substring looked for. — zvone, Sep 01 '18 at 13:29
so this is a misunderstanding from my side (i am sorry i fix he question right now ) so if i got 12121221221 i need the 221 as sub string so in that sense i need the longest but indeed when it comes to 12121212 4 times 12 is ok (i am working for and exercise I am sorry for the misunderstanding ) — Jirka, Sep 01 '18 at 13:40
Should it not be `122` twice in `12121221221`, rather than `221` ? — Paolo, Sep 01 '18 at 13:51
I am clearly bad at examples.the point is that it should be the longest. There can't be a equally length and equally amount of appearance (as here 122 and 221 appear 2 times ) — Jirka, Sep 01 '18 at 13:56
You say that for `12121212` you want 4x `12`. Why not 2x `1212` ? — Jirka, Sep 01 '18 at 14:09
because this input "lvs('2111222122222221211221222112211')" should give output 4. I am working with an exercise so I don't know the logic behind it other then it is what is asked from me — Jirka, Sep 01 '18 at 14:34

score 2 · Answer 1 · answered Sep 01 '18 at 14:16

2

Really inefficient, but you can

find all substrings (https://stackoverflow.com/a/22470047/264596)
put them into a set to avoid duplicates
for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)

Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.

answered Sep 01 '18 at 14:16

Jirka

4,184
30
40

what i need if i use the first function you recommended (thanks for that) is that i need the duplicates. because i need to get the longest that is repeated. The idea of a for loops sparks to mind for this but I am not sure if that is a huge detour or not – Jirka Sep 01 '18 at 15:36
@LibertyVerleysen Yes, you need duplicates, but not in step 2. In step 3, you will get for each unique substring, how many times it repeats and you can choose which one is the best (longer, more common, some balance between those). – Jirka Sep 01 '18 at 15:44
ok i get that but my only problem now is that i have for example "1211" so 1 is repeated 2 times. but if i count 1 i get 3. is there a elegant way of solving this ? – Jirka Sep 01 '18 at 15:50

Schalton · Answer 2 · 2018-09-02T21:37:14.023

I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:

#Func1
def sub_string_cts(string):
    combos = {}
    for i in range(len(string)):
        u_start = len(string) - i
        for start in range(u_start):
            c_str = string[start:i+start+1]

            if c_str in combos:
                combos[c_str] += 1
            else:
                combos[c_str] = 1
    return combos

sub_string_cts('21212')

{'2': 3,
 '1': 2,
 '21': 2,
 '12': 2,
 '212': 2,
 '121': 1,
 '2121': 1,
 '1212': 1,
 '21212': 1}

After your comment I think this is more what you're looking for:

#Func2
def sub_string_cts(string):
    combos = {}
    for i in range(len(string)):
        u_start = len(string) - i
        substrs = set([string[start:i+start+1] for start in range(u_start)])
        for substring in substrs:
            combos[substring]  = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)

    return combos

sub_string_cts('21212')

{'2': 1,
 '1': 1,
 '21': 2,
 '12': 2,
 '212': 1,
 '121': 1,
 '2121': 1,
 '1212': 1,
 '21212': 1}

You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:

def max_by_len(result_dict):
    results = {}
    for k, v in result_dict.items():
        if len(k) not in results:
            results[len(k)] = {}

    for c_len in [ln for ln in results]:
        len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
        for k,v in result_dict.items():
            if len(k) == c_len:
                if v == len_max_count:
                    results[c_len][k] = v
    return results

#Func1:
max_by_len(sub_string_cts('21212'))

{1: {'2': 3},
 2: {'21': 2, '12': 2},
 3: {'212': 2},
 4: {'2121': 1, '1212': 1},
 5: {'21212': 1}}

#Func2:
max_by_len(sub_string_cts('21212'))

{1: {'2': 1, '1': 1},
 2: {'21': 2, '12': 2},
 3: {'212': 1, '121': 1},
 4: {'2121': 1, '1212': 1},
 5: {'21212': 1}}

Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:

def remove_lesser_patterns(result_dict):
    len_lst = sorted([k for k in result_dict], reverse=True)
    #len_lst = sorted([k for k in max_len_results])
    len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}

    for i_len in len_lst[:-1]:
        eval_lst = [i for i in len_lst if i < i_len]

        for i in eval_lst:
            if len_crosswalk[i] <= len_crosswalk[i_len]:
                if i in result_dict:
                    del result_dict[i]
    return result_dict

#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))

{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}

#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))

{2: {'21': 2, '12': 2}, 5: {'21212': 1}}

results:

test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
    print("<Input: '{}'".format(string))
    c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
    print("<Output: {}\n".format(c_answer))

<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice

<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice

<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice

<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice

These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.

Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.

Note:

With your example:

3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.

the code solves this ambiguity in your desired output (2 or 3) by giving you both:

<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice

If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:

test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
    print("<Input: '{}'".format(string))
    c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
    print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1

<Input: '1212'
<Output: 2

<Input: '21212'
<Output: 2

<Input: '12221221'
<Output: 1

I think i fail to express myself. i only need the patterns which follow each other . since is never followed by another this can never be a pattern. As for the function all i got is following assignment "Write a function that takes a string. The function must return the length of the longest doubled suffix of this string." — Jirka, Sep 02 '18 at 20:43
updated, I think func2 using regex is your desired output @LibertyVerleysen — Schalton, Sep 02 '18 at 21:16
At this point I've answered ever combination of the ambiguous things in your question. #AcceptIt — Schalton, Sep 02 '18 at 21:37

Finding an unknown pattern in a string python

2 Answers2

Note: