Upper bound on 4 digit sequences in pi

Question

If this is not the right SE site for this question, please let me know.

A friend shared this interview question he received over the phone, which I have tried to solve myself. I will paraphrase:

The value of pi up to n digits as a string is given.

How can I find all duplicate 4 digit sequences in this string?

This part seems fairly straight forward. Add 4 character sequences to a hash table, incrementing one character at a time. Check if the current 4 character sequence already exists before insertion into the hash table. If so, then you have found a duplicate. Store this somewhere, and repeat the process. I was told this was more or less correct.

The issue I have is on the second question:

What is the upper bound?

n = 10,000,000 was an example.

My algorithm background is admittedly very rusty. My first thought is that the upper bound must be related to n somehow, but I was told it is not.

How do I calculate this?

EDIT:

I would also be open to a solution that disregards the restraint that the upper bound is not related to n. Either is acceptable.

@PascalCuoq Approximately how many iterations would it take to find all duplicates. — Josh, Jan 02 '15 at 22:24
both the hashtable and the user3386109's solutions are obviously _O(n)_ (amortized in the hashtable case, strict for the preallocated array), and there could be no simpler solution because every digit in the array must be looked at. — kkm inactive - support strike, Jan 02 '15 at 22:47

score 2 · Accepted Answer · answered Jan 02 '15 at 22:36

There are only 10,000 possible sequences of four digits (0000 to 9999), so at some point you will have found that every sequence has been duplicated, and there's no need to process further digits.

If you assume that pi is a perfectly uniform random number generator, then each new digit that's processes results in a new sequence, and after about 20,000 digits, you will have found duplicates for all 10,000 sequences. Given that pi is not perfect, you may need significantly more digits before you duplicate all sequences, but 100,000 would be a reasonable guess at the upper bound.

Also, since there are only 10,000 possibilities, you don't really need a hash table. You can simply use an array of 10000 counters, (int count[10000]), and increment the count for each sequence you find.

score 0 · Answer 2 · answered Jan 02 '15 at 22:25

0

The upper bound of your solution is the size of the hash table that you can fit into memory.

An alternate technique is to generate all the sequences and sort them. Then the duplicates will be adjacent and easy to detect. You can generally fit more into a linear data structure than you can a hash table, and if you still exhaust memory you can sort to/from disk.

Edit: unless "upper bound" means the O(n) of the algorithm, which should be easy to figure out.

answered Jan 02 '15 at 22:25

Mark Ransom

299,747
42
398
622

To your edit, yes, I'm fairly certain this was the desired answer. Like I said, I'm extremely rusty. As simple as it may be, how do I calculate O(n)? – Josh Jan 02 '15 at 22:28
@iThink looking at your question again, O(n) is *obviously* related to n, but you state that it is not. That makes me very confused. – Mark Ransom Jan 02 '15 at 22:30
I agree. That is what really confused me. That being said, I am open to any interpretation on this question. – Josh Jan 02 '15 at 22:34

Upper bound on 4 digit sequences in pi

2 Answers2