I am trying to write a tokenizer for parsing a text body (input string) using a Regex. What I want is to split the input in individual tokens and store these in a List<Token>, where token is a (C#) class like
class Token {
string value;
string type; // "identifier", "string', "intliteral', ...
}
I want to use a regex like the one below for splitting up the input string:
public static Regex tokenPattern = new Regex (
@"
( (?<identifier>(?:\p{L}|_)\w*)
| (?<string>""[^""]*"")
| (?<intliteral>(?:-|\+)?\d+[^\.])
| (?<realliteral>(?:-|\+)?\d+(?:\.\d+)?)
| (?<comma>,)
| (?<lpar>\()
| (?<rpar>\))
| ...
| (?<undefined>[^\s]*?)
)
",
RegexOptions.ExplicitCapture |
RegexOptions.IgnorePatternWhitespace |
...
);
My problem is that it is easy to obtain the value part of each Token, but there does not seem to be an easy way to get the type part, i.e. the group name. I expected that a Regex Group would have a Name property containing "identifier" etc., but that does not seem to be the case.
Is there a way to determine a group name without iterating over all group names/numbers for each token? (i.e. an approach with complexity O(n) instead of O(nm) , n number of tokens in the input string, m number of token types)?