9

I have some "tokenized" templates, for example (I call tokens the part between double braces):

var template1 = "{{TOKEN1}} is a {{TOKEN2}} and it has some {{TOKEN3}}";

I want to extract an array from this sentence, in order to have something like:

Array("{{TOKEN1}}",
      " is a ",
      "{{TOKEN2}}", 
      " and it has some ", 
      "{{TOKEN3}}");

I've tried to achieve that with the following Regex code:

Regex r = new Regex(@"({{[^\}]*}})");
var n = r.Split(template1);

And the result is:

Array("",
      "{{TOKEN1}}",
      " is a ",
      "{{TOKEN2}}", 
      " and it has some ", 
      "{{TOKEN3}}",
      "");

The first issue was that I was not able to recover the tokens from the sentence. I solved this just by adding the parentheses on the Regex expression, even though I'm not sure why does it solves this.

The issue I'm currently facing is the extra empty term in the beginning and/or in the end of the array when the first and/or last terms on the template are "tokens". Why is it happening? Am I doing something wrong, or I should I always check these two positions for emptiness?

On my code, I will need to know which term came from a token and which was a fixed position on the template. On this solution, I will have to check every array's position for a string starting with "{{" and ending with "}}", which I don't think is the best possibility. So, if someone comes up with a better solution to break these things apart, I'll be glad to know!

Thank you!

Edit: as requested, I'll post a simple example to why do I need this distinction on tokens and text.

public abstract class TextParts { }
public class TextToken : TextParts { }
public class TextConstant : TextParts { }

var list = new List<TextParts>();
list.Add( new TextToken("{{TOKEN1}}") );
list.Add( new TextConstant(" is a ") );
list.Add( new TextToken("{{TOKEN2}}") );
/* and so on */

This way, I'll have a list of the parts that composes my string and I'll be able to record that on my database to allow future manipulation and substitution. In fact, each of this TOKEN will be replaced by a Regex string.

The objective is that users will be able to input messages like "{{SERVER}} is not listening on port {{PORT}}", and I'll be able to replace "{{SERVER}}" to [a-zA-Z0-9 ]+ and "{{PORT}}" to \d{1,5}. Makes sense?

I hope this makes the post more clear.

tyron
  • 3,715
  • 1
  • 22
  • 36
  • 2
    Why do you need to break apart the lines? Is it not enough to just replace the tokens? (because just removing empty strings from beginning and end WOULD be the best solution if your final goal is to retrieve the first array you posted; but I suppose you want to accomplish something with that array, and we could probably give better answers if we knew what you want to achieve ;)) – Martin Ender Oct 13 '12 at 18:00
  • 1
    I'm sorry for not answering your question before. In my models, I will need to create different objects for the tokens and for the text. I'm updating the last part of my post just to reflect this. – tyron Oct 13 '12 at 20:34

2 Answers2

5

If you split a string along delimiters, and the string starts or ends with a delimiter, that means there is an empty element before/after the first/last delimiter:

Imagine the following line in a CSV file:

,a,b,c,

That CSV row contains the elements "", "a", "b", "c", and "".

The same thing happens with your {{TOKEN}}. You could use a different method:

MatchCollection allMatchResults = null;
Regex regexObj = new Regex(@"\{\{[^{}]*\}\}|[^{}]+");
allMatchResults = regexObj.Matches(subjectString);

If single braces may occur within or between tokens, you can also use

Regex regexObj = new Regex(@"\{\{(?:(?!\}\}).)*\}\}|(?:(?!\{\{).)+");

which will be a bit less efficient, though, because of all the lookahead assertions, so you should use this only if you need to.

Edit: I just noticed that there was another question in your post: Why did you need to add parentheses around your regex to make it "work"? Answer: Usually, a split() command only returns the contents between the delimiters. If you enclose the delimiters (or parts thereof) in capturing parentheses, then whatever is matched within those parentheses will also be added to the resulting list.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • one should note that the reason this works is, that matches can never overlap (otherwise token contents would become separate matches) – Martin Ender Oct 13 '12 at 18:17
  • 1
    also this solution is problematic if single or unmatched curly brackets appear in the text – Martin Ender Oct 13 '12 at 18:18
  • @m.buettner: Good point. I'll add another regex for this contingency. – Tim Pietzcker Oct 13 '12 at 18:22
  • Really good explanation, as well as the examples provided! Thank you. I didn't think about your CSV example; it became pretty obvious to why there are the empty elements. As pointed by @m.buettner, the second example works better because of the single braces. But I didn't get if there's any difference between your second regex + `regexObj.Matches()` and my simpler regex + `regexObj.Split()`. Is there any? – tyron Oct 13 '12 at 19:48
  • @tyron: There were a few minor differences in the way single braces would be treated. For example, in the situation `{{{TOKEN}}`, your `split()` would have split off the first `{` whereas mine would have made it part of the token. There was one error in my second regex that led to empty matches; I have now changed the final `*` to a `+` (as in the first regex), so now it should be OK. – Tim Pietzcker Oct 14 '12 at 06:54
0

Try this pattern, it will get your tokens out as matches.

\b*\{{2}\w+\}{2}\b*
Rob
  • 517
  • 3
  • 10
  • This answer has several problems, the most important being that `\b` can't match where you expect it to unless there happens to be a word character immediately before and immediately after each token (e.g., `5{{TOKEN}}q`). The reason your regex *seems* to work is because the `*` allows it match `\b` zero times (i.e., to ignore it). – Alan Moore Oct 14 '12 at 08:29