4

I'm working on MIME message generation code and I'd like to generate as small boundaries as possible for any given input even of unknown length in stream mode.

Right now I end up with good enough solution which based on random generator. Basically I generate random string of 32 Base64 symbols and try to find shortest substring in it which is not substring of MIME message body.

This is not perfect solution because:

  1. The boundary is not always shortest. For very simplified example: for alpha-only text the boundary could be just one digit, but generated boundary material could contain only alphas.

  2. I need random generator and unique seed for it each time I run the application. Ideally better to have deterministic algorithm.

So that's what I want to know. It's possible to keep the property of streaming algorithm, work on fixed amount of memory, be deterministic and generate ideal shortest boundary? Or we can achieve only some of properties by tradeoffs?

  • 2
    I think [suffix automaton](https://en.wikipedia.org/wiki/Suffix_automaton) will be helpful to you – throwit Nov 30 '15 at 05:55
  • Why do you care about the length of the boundary string? A reasonably long unique string is also more likely to work correctly with other tools which take less care to make sure they have a unique boundary string. – tripleee Dec 02 '15 at 12:48
  • Thanks @tripleee, that's really interesting point. – Felix Vanorder Dec 25 '15 at 21:38
  • There is a similar question [here](http://cs.stackexchange.com/questions/21896/algorithm-request-shortest-non-existing-substring-over-given-alphabet). – mik Dec 28 '15 at 16:51

1 Answers1

2

All the boundaries start with -- and are on a separate line. You can use this to create a list of all possible "boundary-like" words in the body, then create a unique word to use (e.g. lexicographically).

Furthermore, assuming that you have less than 26 parts you can simply use single letters if you want the "shortest possible" boundaries. In this case the scanning could be done using a regex:

^--([a-z])$

This (in multiline context) will match all single letter "context-like" tokens in the email body.

Assuming you put the list of matched values in a hashset, then you can generate the tokens with something like

('a'...'z').where(!tokenHashSet.contains)

All the above is in pseudocode, hopefully it's clear.

Sklivvz
  • 30,601
  • 24
  • 116
  • 172