How to implement an efficient tokenizer with a regex, using group names

Question

I am trying to write a tokenizer for parsing a text body (input string) using a Regex. What I want is to split the input in individual tokens and store these in a List<Token>, where token is a (C#) class like

class Token {
  string value;
  string type; // "identifier", "string', "intliteral', ... 
}

I want to use a regex like the one below for splitting up the input string:

public static Regex tokenPattern = new Regex (
@"
  ( (?<identifier>(?:\p{L}|_)\w*)
  | (?<string>""[^""]*"")
  | (?<intliteral>(?:-|\+)?\d+[^\.])
  | (?<realliteral>(?:-|\+)?\d+(?:\.\d+)?)
  | (?<comma>,)
  | (?<lpar>\()
  | (?<rpar>\))
  | ...
  | (?<undefined>[^\s]*?)
  )
",
  RegexOptions.ExplicitCapture |
  RegexOptions.IgnorePatternWhitespace | 
  ...
);

My problem is that it is easy to obtain the value part of each Token, but there does not seem to be an easy way to get the type part, i.e. the group name. I expected that a Regex Group would have a Name property containing "identifier" etc., but that does not seem to be the case.

Is there a way to determine a group name without iterating over all group names/numbers for each token? (i.e. an approach with complexity O(n) instead of O(nm) , n number of tokens in the input string, m number of token types)?

According to [this documentation](https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.group.name?view=netframework-4.7.2#System_Text_RegularExpressions_Group_Name) the `Group` does have a `Name` property. But I don't know if that really helps you, since the `GroupCollection` seems to have all groups, matching or not. — rici, Oct 14 '18 at 13:51
Yes, that is the problem. The GroupCollection contains all the Group names, not only the names of the group(s) to which the match belongs — John Pool, Oct 14 '18 at 15:36
writing lexers us probably not the primary expected use case of the regex library. There are tools like Flex available for generating efficient lexers, and I'm pretty sure you can find a C# port. But I've used the same technique as you describe here for writing q&d lexers in JavaScript and it works fine, although it's worthwhile reducing the number of patterns to a minimum. In JS you can do a lexical loop with global search&replace, because you can use a function as the replacement arg. I don't know if C# does that. — rici, Oct 14 '18 at 15:46
... but it's still O(nm) because the called function needs to check through its arguments for the one which actually matched. — rici, Oct 14 '18 at 15:47
Thanks for your answer. I am going to have a look at Flex. Actually TypeScript / JavaScript is my target language. — John Pool, Oct 15 '18 at 07:16
that would have been useful to know before. Your question makes it appear that C# is your target; it's a different language with a different regex library. However, JS doesn't help; it's regex library has the same issue, as noted in my comment above. — rici, Oct 15 '18 at 17:42
As you said, that would not have made a lot of difference, I used C# for doing some tests, and I was aware that both regex libraries do not give group name(s) for matches. Thanks again -- John — John Pool, Oct 16 '18 at 07:24
the difference is that I have (old) JS code kicking around somewhere, and I would have dug it out had I known it was relevant. Although it's several versions of js behind the times, so maybe ir wouldn't have been that useful — rici, Oct 16 '18 at 07:33

score 0 · Answer 1 · answered Oct 21 '18 at 13:36

This would be a multiple phase operation and having one regex to do such an operation would not be a good use of processor time. What I recommend is to divvy of the phases of the operation such as:

Parse each value using a basic regex into the token.
Have a specific operation to identify what type of token is encountered and set that value accordingly.

You would most have to continue to break the steps after the 2nd step to achieve greater efficiencies.

I have to agree with the sentiment that regex is not a tool for token language processing past identifying individual tokens or within a token process to sub identify token attributes.

Thanks for your answer, I am aware of this two-stage approach and my question was precisely how to avoid it. That is what I mean with O(n) rather than O(nm). — John Pool, Oct 22 '18 at 07:43

How to implement an efficient tokenizer with a regex, using group names

1 Answers1