Symbolic representation of patterns in strings, and finding "similar" sub-patterns

Question

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.

"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!

How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.

It's not quite clear what should be considered a pattern. For example, do `abadad` and `ebefef` match fully? — raina77ow, Sep 11 '12 at 16:52
Yes, they do. If some two sub-strings when run through that simple indexing algorithm would return the same result, then those two sub-strings are matching sub-patterns. But I want to apply it to every possible sub-string in every word, while I'm only converting each whole word into patterns. — user173342, Sep 11 '12 at 17:06

score 1 · Answer 1 · answered Sep 11 '12 at 18:06

Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.

score 0 · Accepted Answer · edited May 23 '17 at 10:27

You're algorithm has loss of information from compressing the string's original data set so I'm not sure you can recover the full information set without doing far more work than comparing the original string. Also while your data set appears easier for human readability, it current takes up as much space as the original string and a difference map of the string (where the values are the distance between the prior character and current character) may have a more comparable information set.

However, as to how you can detect all common subsets you should look at Least Common Subsequence algorithms to find the largest matching pattern. It is a well defined algorithm and is efficient -- O(n * m), where n and m are the lengths of the strings. See LCS on SO and Wikipedia. If you also want to see patterns which wrap around a string (as a circular stirng -- where abeab and eabab should match) then you'll need a ciruclar LCS which is described in a paper by Andy Nguyen.

You'll need to change the algorithm slightly to account for number of variations so far. My advise would be to add two additional dimensions to the LCS table representing the number of unique numbers encountered in the past k characters of both original strings along with you're compressed version of each string. Then you could do an LCS solve where you are always moving in the direction which matches on your compressed strings AND matching the same number of unique characters in both strings for the past k characters. This should encode all possible unique substring matches.

The tricky part will be always choosing the direction which maximizes the k which contains the same number of unique characters. Thus at each element of the LCS table you'll have an additional string search for the best step of k value. Since a longer sequence always contains all possible smaller sequences, if you maximize you're k choice during each step you know that the best k on the next iteration is at most 1 step away, so once the 4D table is filled out it should be solvable in a similar fashion to the original LCS table. Note that because you have a 4D table the logic does get more complicated, but if you read how LCS works you'll be able to see how you can define consistent rules for moving towards the upper left corner at each step. Thus the LCS algorithm stays the same, just scaled to more dimensions.

This solution is quite complicated once it's complete, so you may want to rethink what you're trying to achieve/if this pattern encodes the information you actually want before you start writing such an algorithm.

It seems LCS wouldn't be able to handle any upset to the order of indices. For instance, "dcece" vs "abab" become "01212" vs "0101", but the "cece" should match with "abab". Or any other confounding possibilities to mess with what indices represent "the same thing". But regardless, I know there's loss of information, it doesn't matter. — user173342, Sep 11 '12 at 17:58
Circular LCS can handle that case easily. LCS with your information set only finds common sequences which start the same. Circular will absolve this problem by trying each rotated start location -- with a naive implementation you can quickly test each rotation of one string up to the limit of the original string end (i.e. rotate by 1 character and shorten string by 1 character at a time). — Pyrce, Sep 11 '12 at 18:16
I suppose it would work, but it seems really inefficient. Considering the simplification of information with this pattern representation, combined with my storage of all of the patterns, sub-patterns, and their inter-relations, it seems like there should be significant shortcuts that could be taken in finding matches... I was hoping this sort of problem had already been examined and solved, but maybe not I guess. — user173342, Sep 11 '12 at 18:29
I agree, but with the information set given you would otherwise need to do all permutations of all substrings of both sets to find what you're looking at -- which is inherently much slower, though easier to write a solver for. It might be that the simplification you applied doesn't buy you much processing time saved. Did you have a particular use case in mind? — Pyrce, Sep 11 '12 at 20:13
Well, the pattern matches/relations are the end, not the means, so I'm only trying to optimize their usage. — user173342, Sep 11 '12 at 20:19
The reason I asked is that this data format seems to only be useful in very specific scenarios. It doesn't extend generally to comparing strings in a variety of cases. And for such specific cases where it is useful you might be able to do something easier with regex or other pattern matches. As a general concept, I don't think solving for the least common subpattern with this data set is ever going to be trivial. — Pyrce, Sep 11 '12 at 20:56

gusbro · Answer 3 · 2012-09-11T19:25:43.370

Here goes a solution that uses Prolog's unification capabilities and attributed variables to match templates:

:-dynamic pattern_i/3.

test:-
  retractall(pattern_i(_,_,_)),
  add_pattern(abab),
  add_pattern(bcbc),
  add_pattern(babcbc),
  add_pattern(dababd),
  show_similarities.

show_similarities:-
  call(pattern_i(Word, Pattern, Maps)),
  match_pattern(Word, Pattern, Maps),
  fail.
show_similarities.

match_pattern(Word, Pattern, Maps):-
  all_dif(Maps), % all variables should be unique
  call(pattern_i(MWord, MPattern, MMaps)),
  Word\=MWord,
  all_dif(MMaps),
  append([_, Pattern, _], MPattern), % Matches patterns
  writeln(words(Word, MWord)),
  write('mapping: '),
  match_pattern1(Maps, MMaps). % Prints mappings

match_pattern1([], _):-
  nl,nl.
match_pattern1([Char-Char|Maps], MMaps):-
  select(MChar-Char, MMaps, NMMaps),
  write(Char), write('='), write(MChar), write(' '),
  !,
  match_pattern1(Maps, NMMaps).

add_pattern(Word):-
  word_to_pattern(Word, Pattern, Maps),
  assertz(pattern_i(Word, Pattern, Maps)).

word_to_pattern(Word, Pattern, Maps):-
  atom_chars(Word, Chars),
  chars_to_pattern(Chars, [], Pattern, Maps).

chars_to_pattern([], Maps, [], RMaps):-
  reverse(Maps, RMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
  member(Char-PChar, Maps),
  !,
  chars_to_pattern(Tail, Maps, Pattern, NMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
  chars_to_pattern(Tail, [Char-PChar|Maps], Pattern, NMaps).

all_dif([]).
all_dif([_-Var|Maps]):-
  all_dif(Var, Maps),
  all_dif(Maps).

all_dif(_, []).
all_dif(Var, [_-MVar|Maps]):-
  dif(Var, MVar),
  all_dif(Var, Maps).

The idea of the algorithm is:

For each word generate a list of unbound variables, where we use the same variable for the same char in the word. e.g: for the word abcbc the list would look something like [X,Y,Z,Y,Z]. This defines the template for this word
Once we have the list of templates we take each one and try to unify the template with a subtemplate of every other word. So for example if we have the words abcbc and zxzx, the templates would be [X,Y,Z,Y,Z] and [H,G,H,G]. Then there is a subtemplate on the first template which unifies with the template of the second word (H=Y, G=Z)
For each template match we show the substitutions needed (variable renamings) to yield that match. So in our example the substitutions would be z=b, x=c

Output for test (words abab, bcbc, babcbc, dababd):

?- test.

words(abab,bcbc)
mapping: a=b b=c 

words(abab,babcbc)
mapping: a=b b=c 

words(abab,dababd)
mapping: a=a b=b 

words(bcbc,abab)
mapping: b=a c=b 

words(bcbc,babcbc)
mapping: b=b c=c 

words(bcbc,dababd)
mapping: b=a c=b

The mapping is interesting, but what about cases that don't have a clean mapping? Such as "bcbcbaba" and "dedeefef"? Never used prolog, so my question might be dumb. Anyway, I'm more wondering about the underlying algorithm at play rather than having a working implementation. — user173342, Sep 11 '12 at 19:09

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

3 Answers3