Generate shortest NOT substring for given string

Question

I'm working on MIME message generation code and I'd like to generate as small boundaries as possible for any given input even of unknown length in stream mode.

Right now I end up with good enough solution which based on random generator. Basically I generate random string of 32 Base64 symbols and try to find shortest substring in it which is not substring of MIME message body.

This is not perfect solution because:

The boundary is not always shortest. For very simplified example: for alpha-only text the boundary could be just one digit, but generated boundary material could contain only alphas.
I need random generator and unique seed for it each time I run the application. Ideally better to have deterministic algorithm.

So that's what I want to know. It's possible to keep the property of streaming algorithm, work on fixed amount of memory, be deterministic and generate ideal shortest boundary? Or we can achieve only some of properties by tradeoffs?

I think [suffix automaton](https://en.wikipedia.org/wiki/Suffix_automaton) will be helpful to you — throwit, Nov 30 '15 at 05:55
Why do you care about the length of the boundary string? A reasonably long unique string is also more likely to work correctly with other tools which take less care to make sure they have a unique boundary string. — tripleee, Dec 02 '15 at 12:48
There is a similar question [here](http://cs.stackexchange.com/questions/21896/algorithm-request-shortest-non-existing-substring-over-given-alphabet). — mik, Dec 28 '15 at 16:51

score 2 · Answer 1 · answered Dec 30 '15 at 14:41

All the boundaries start with -- and are on a separate line. You can use this to create a list of all possible "boundary-like" words in the body, then create a unique word to use (e.g. lexicographically).

Furthermore, assuming that you have less than 26 parts you can simply use single letters if you want the "shortest possible" boundaries. In this case the scanning could be done using a regex:

^--([a-z])$

This (in multiline context) will match all single letter "context-like" tokens in the email body.

Assuming you put the list of matched values in a hashset, then you can generate the tokens with something like

('a'...'z').where(!tokenHashSet.contains)

All the above is in pseudocode, hopefully it's clear.

Generate shortest NOT substring for given string

1 Answers1