8

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.

Say

S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI

I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.

Aditya
  • 1,240
  • 2
  • 14
  • 38
  • 2
    Why is S2 substring not AB? – takendarkk Jan 16 '14 at 17:46
  • 1
    Now I'm confused about S3. ABC does not repeat, or is that also a typo? Sorry for being picky, I'm just trying to figure out exactly what you want for output. – takendarkk Jan 16 '14 at 17:49
  • Well, every character in the input string is part of the repeating substring. Doesnt matter what the string length is. @csmckelvey – Aditya Jan 16 '14 at 17:53
  • [Here is similar question containing solution to this problem](http://stackoverflow.com/questions/8347812/given-string-s-find-the-shortest-string-t-such-that-tm-s) – Evgeny Kluev Jan 16 '14 at 17:53
  • 1
    Now I'm thoroughly confused. Moving on, good luck! – takendarkk Jan 16 '14 at 17:54
  • @EvgenyKluev I don't get it. This question is essentially the same as his, why are people recommending him such complex things as the Z algorithm when this could be essentially accomplished by the tortise and hair algorithm which is (A) easier to implement and much more efficient ! – Aditya Jan 16 '14 at 18:06
  • Because tortoise and hare algorithm does not solve the problem. Also you could read this paper: ["On the Complexity of Determining the Period of a String"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.9627&rep=rep1&type=pdf). – Evgeny Kluev Jan 16 '14 at 18:08
  • The only case I can think of when the tortoise and hare doesn't work is when the string is small. Like the third example. It works for ABAB. It works fine for ABCABCABC. Since it works for even and odd length substring I guess I can safely say it works in all cases. @EvgenyKluev – Aditya Jan 16 '14 at 18:21
  • 2
    For string "abababxxxxabababxxxx" tortoise and hare gives you period 2 while actual period is 10. – Evgeny Kluev Jan 16 '14 at 18:26
  • Just one question: Do we know that there must be a period or in case there is no period we need to find it out? – Łukasz Kidziński Jan 16 '14 at 18:36
  • It was a interview question and he asked me to assume that there is a period. I guess without that assumption this would be complicated immensly @ŁukaszKidziński – Aditya Jan 16 '14 at 18:38

6 Answers6

8

You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).

user541686
  • 205,094
  • 128
  • 528
  • 886
tmyklebu
  • 13,915
  • 3
  • 28
  • 57
  • +1 for a good reference. The algorithm is really there, page 340. Doesn't seem like an interview solution though :) – Łukasz Kidziński Jan 16 '14 at 20:26
  • 1
    @R, looking briefly at the text, in the signature of `Next_MS` I suspect `x[1…m]` should be `x[1...n]` (since `m` is not used and `n` is used without definition); `then end` should almost certainly be `then begin`; `ms := j; j := ms+1;` looks suspicious; in `Per` there's a `then` which should probably be `then begin` and an `else end` which should probably be `else begin`. But I'm not certain that those are the errors I saw 15 months ago, and an official errata list would in any case be far more valuable than my opinion. – Peter Taylor Nov 20 '18 at 17:18
3

Let me assume that the length of the string n is at least twice greater than the period p.

Algorithm

  1. Let m = 1, and S the whole string
  2. Take m = m*2
    • Find the next occurrence of the substring S[:m]
    • Let k be the start of the next occurrence
    • Check if S[:k] is the period
    • if not go to 2.

Example

Suppose we have a string

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC

For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD   CDCD   CDCD   CDCD   CD

We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD   CDCD

Now let's extend our string to the first repetition. We get

CDCDFBF

we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC      CDCDFBFC      

we try to extend and we get

CDCDFBFCDCDFDF

and this indeed is our period.

I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.

Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.

A very nice problem though.

Łukasz Kidziński
  • 1,613
  • 11
  • 20
1

You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).

user2566092
  • 4,631
  • 15
  • 20
  • You certainly can do this with a suffix tree, but that will be orders of magnitude slower than doing KMP preprocessing. – tmyklebu Jan 16 '14 at 19:59
  • @tmyklebu How do you figure that the constants for suffix trees are orders of magnitude larger than other algorithms? That means like constants of 1000. A suffix tree certainly does not have hidden constants of 1000. – user2566092 Jan 16 '14 at 22:07
  • I haven't actually seen an implementation that doesn't. You're welcome to link me to your favourite implementation and I can compare against a good implementation of the linear-time, constant-space algorithm I referenced, but suffix trees being a factor of 1000 slower really wouldn't surprise me. – tmyklebu Jan 17 '14 at 04:01
1

We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string S1 = abababab For this the z array would like z[]={8,0,6,0,4,0,2,0}; In order to calcutate the period we can iterate over the z array and use the condition, where i+z[i]=S1.length. Then, that i would be the period.

0

Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.

  • Oh I didn't see that. But since you said it has to include a cycle, that method's complexity will be O(N) because you are just comparing first character with the whole string in the worst case scenario. @csmckelvey Pseudocode will be like this: `char first = input[0]; String repeatingString = first; int i = 1; char nextChar = input[i]; while(first!=nextChar) { repeatingString += nextChar; i++; nextChar = input[i]; } ` – Mehmet Sedat Güngör Jan 16 '14 at 18:14
  • Its O(N^2) in the worst case cause it would first check with 1 character then 2 characters than 3 .. till N/2 so it is 1+2+3+..N/2 => (N/2)(N/2+1)/2 Dropping all constants you have N^2 – Aditya Jan 16 '14 at 18:24
  • @Aditya No, with the above method, it will check first character with the second one, then third one etc. So it will be 1+1+1.. and in the worst case scenerio, it will be N(length of input) times character comparing. You are not comparing A with AB, and then A with ABC. You are comparing A with B and then A with C. So it will be O(N). Just look at above pseudocode I provided. It will loop maximumly for input string's length. – Mehmet Sedat Güngör Jan 16 '14 at 18:28
  • Sure, the pseudocode is O(n) but it returns just a string of repeated first character. – Łukasz Kidziński Jan 16 '14 at 19:04
  • @ŁukaszKidziński No, next characters until the match is found will be appended to String. So, it will have repeated cycle String value in it. Look at the code, it appends the nextChar of input to String until a match found. – Mehmet Sedat Güngör Jan 17 '14 at 02:22
  • The pseudo-code doesn't work, consider `ababacababac` (the correct substring is `ababac`, not `ab`). And if you modify it so that it does work, it will no longer be O(n). – Bernhard Barker Jan 18 '14 at 20:45
  • @Dukeling I repeat, this is not a general solution for finding a substring that repeats. OP mentioned that "every character in the input string is part of the repeating substring" and gave some examples, so that was a specific solution. Tortoise and Hare will also give wrong output for that case. So it has to be bigger than O(N), no different thoughts. – Mehmet Sedat Güngör Jan 19 '14 at 14:31
0

I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.

First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?

In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.

The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.

The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.

Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711