2

Is there an algorithm to find the minimal sequence(s) within a large sequence?

Without any prior knowledge of what those sequences would be?

For example, given the sequence {2,3,1,2,3,1}, it would return {2,3,1}

Additionally, if there is a non-repeating sequence alongside it, i.e. {2,3,1,2,3,1,1,2,3,4}, that sequence would simply get ignored, and you would get {2,3,1} {1,2,3,4}

And lastly, if there is a non-repeating sequence between, i.e. {2,3,1,1,2,3,4,2,3,1}, you would also get {2,3,1} {1,2,3,4}

Any guidance in this area would be appreciated. I've been playing around with reg-ex's to try and get that to work, but am not sure if that is the best way to go, and even if so, have not been able to get a regex to perform that operation.

jekelija
  • 257
  • 4
  • 16
  • I don't understand what you mean. How is `{1,2,3,4}` a repeating sequence in `{2,3,1,2,3,1,1,2,3,4}` ? It appears just once... The minimal repeating sequence I see here is `{1}`, because there's two ones next to each other. I really don't get what you're trying to get. – Kewin Dousse Aug 24 '16 at 13:30
  • You can do this by customizing a [Suffix Tree](https://en.wikipedia.org/wiki/Suffix_tree). – Mazdak Aug 24 '16 at 13:37
  • @Protectator its basically about identifying any patterns that you can find, and leaving any "non-patterns" alone. They are essentially, in the context of the problem i'm attempting to tackle, a "pattern that repeats once (the one time it occurs)". An alternative would be simply to return all patterns found ({2,3,1}), and then i would just pass through and find all elements that didnt match any patterns found – jekelija Aug 24 '16 at 13:50
  • Which language do you use? – Thomas Ayoub Aug 24 '16 at 14:03
  • @ThomasAyoub C++, although i'm familiar with most other common ones, so feel free to answer in any language, and i'm sure an equivalent can be written in C++ – jekelija Aug 24 '16 at 14:05

2 Answers2

2

Adding to @Thomas's answer you are able to capture those non-repeating sequences within an alternation. It means if third capturing group is not empty then you have such sequences. Also I made middle .* pattern un-greedy:

((?:\d,)+)(.*?)\1+|((?:\d,)+)

Live demo

Update based on comments:

((?:\d,)+?)\1+$|((?:\d,)+)((?:\d,)*?)\2+|((?:\d,)+)

Live demo

revo
  • 47,783
  • 14
  • 74
  • 117
  • Is there a way to further reduce? I.e. in https://regex101.com/r/iW3gE9/2, with example: 1,2,3,1,2,3,1,2,3,1,2,3, it returns 1,2,3,1,2,3, NOT 1,2,3? – jekelija Aug 24 '16 at 17:04
  • That is fantastic. Haha will have to do some performance testing, seems to use a ton of steps, but you definitely did some regex magic there... probably a subject i need to learn a little better... – jekelija Aug 24 '16 at 19:31
  • Beside lots of backtracks of the current regex, lazy quantifiers are like that: making engine to stop at each position to see if is able to continue with the next pattern. More steps, more CPU consuming. @jekelija – revo Aug 24 '16 at 20:04
  • 1
    You should use `((?:\d+,)+?)\1+$|((?:\d+,)+)((?:\d+,)*?)\2+|((?:\d+,)+)` to allow numbers with multiple digits in the array – Thomas Ayoub Aug 25 '16 at 10:43
0

Something like this ((?:\d,?)+)(.*)\1(.*) could match repetitive sequences thanks to the back-reference to the capturing group, see live demo.

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • That seems like a great start. Thank you very much. However, it seems to have some trouble with identifying more than one pattern, i.e. https://regex101.com/r/zT9hU7/2. – jekelija Aug 24 '16 at 14:00