2

I'm developing a little network command interpreter for .net Micro Framework 4.3 running on a Netduino. I use a regular expression to parse user input that arrives from the network via a stream socket. Commands are in the following format:

<T1,0,CommandVerb=Payload>

That's a device address, a transaction ID which can be any integer, a command verb, followed by an equals sign followed by any text. The whole thing is delimited by angle brackets much like an XML tag, which helps with parsing.

Here's the regular expression I use:

    /*
     * Regex matches command strings in format "<Dn,TT,CommandVerb=Payload>
     * D is the Device class
     * n is the device index
     * TT is a numeric transaction ID of at least 1 digits.
     * CommandVerb is, obviously, the command verb ;-)
     * Payload is optional and is used to supply any parameter values to the command.
     * 
     * N.B. Micro Framework doesn't support named captures and will throw exceptions if they are used.
     */

    const string CommandRegex = @"<(\w\d),(\d+),([A-Za-z]\w+)(=((\d+)|(.+)))?>";
    static readonly Regex Parser = new Regex(CommandRegex);

This expression is designed to tease out the various parts of the command so I can access them easily in code. The last part (=((\d+)|(.+)))? differentiates between a numeric payload and a text payload, or no payload at all.

This has been working well for me, and validates OK in ReSharper's regex validator. Here's the output I expect to get (I think this is subtly different from the results you'd get from the full NetFX, I had to work this out by trial and error):

        /* Command with numeric payload has the following groups
         * Group[0] contains [<F1,234,Move=12345>]
         * Group[1] contains [F1]
         * Group[2] contains [234]
         * Group[3] contains [Move]
         * Group[4] contains [=12345]
         * Group[5] contains [12345]
         * Group[6] contains [12345]  
         * -----
         * Command with text payload has the following groups:
         * Group[0] contains [<F1,234,Nickname=Fred>]
         * Group[1] contains [F1]
         * Group[2] contains [234]
         * Group[3] contains [Nickname]
         * Group[4] contains [=Fred]
         * Group[5] contains [Fred]
         * Group[7] contains [Fred]
         * -----
         * Command with verb only (no payload) produces these groups:
         * Group[0] contains [<F1,234,Stop>]
         * Group[1] contains [F1]
         * Group[2] contains [234]
         * Group[3] contains [Stop]
         */

...and it does work like that. Right up to the point where I tried to pass a URL as the payload. As soon as I have a dot (.) in my payload string, the regex breaks and I actually get back the third form, where it clearly thinks there's no payload at all. As an example:

<W1,0,HttpPost=http://deathstar.com/route>

What I expect to get back is the 'command with text payload' result, but what I actually get back is the 'command with no payload' result. If I take out the dot, then it parses as I expect and I get 'command with text payload'. As soon as I put the dot back in, then (ironically) .+ no longer seems to match.

Again note: this validates correctly in ReSharper's regex validator and appears to work on the normal 'desktop' framework as expected, but not in .NET Micro Framework. The Micro Framework regex implementation is a subset of the full version, but the documentation about what is supposed to work and what doesn't is pretty much non-existent.

I can't understand why .+ doesn't match text with a dot in it. Can anyone see why it's not working?

UPDATE 1 - added diagnostics

Here's the output:

[Cmd Processor     ] Parser matched 8 groups
[Cmd Processor     ]   Group[0]: <W1,0,HttpPost=http://deat
[Cmd Processor     ]   Group[1]: W1
[Cmd Processor     ]   Group[2]: 0
[Cmd Processor     ]   Group[3]: HttpPost
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll

So it's not that Group[4] is null, it's throwing an ArgumentOutOfRangeException for that indexer, even though there are 8 groups. Also, Group[0] is mysteriously truncated. Hmmm...

Update 2 - Improved the Diagnostic

I added this diagnostic method to my code, based on answer from @Shar1er80:

    [Conditional("DEBUG")]
    static void PrintMatches(Match match)
        {
        if (!match.Success)
            {
            Dbg.Trace("No match", Source.CommandProcessor);
            return;
            }
        Dbg.Trace("Parser matched "+match.Groups.Count + " groups", Source.CommandProcessor);
        for (int i = 0; i < match.Groups.Count; i++)
            {
            string value;
            try
                {
                var group = match.Groups[i];
                value = group == null ? "null group" : group.Value ?? "null value";
                }
            catch (Exception ex)
                {
                value = "threw " + ex.GetType() + " " + ex.Message??string.Empty;
                }
            Dbg.Trace("  Groups[" + i + "]: " + value, Source.CommandProcessor);
            }
        }

With the test input of <W1,0,HttpPost=http://deathstar.com> the output was:

[Cmd Processor     ] Parser matched 8 groups
[Cmd Processor     ]   Groups[0]: <W1,0,HttpPost=http://deaths
[Cmd Processor     ]   Groups[1]: W1
[Cmd Processor     ]   Groups[2]: 0
[Cmd Processor     ]   Groups[3]: HttpPost
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll
[Cmd Processor     ]   Groups[4]: threw System.ArgumentOutOfRangeException Exception was thrown: System.ArgumentOutOfRangeException
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll
[Cmd Processor     ]   Groups[5]: threw System.ArgumentOutOfRangeException Exception was thrown: System.ArgumentOutOfRangeException
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll
[Cmd Processor     ]   Groups[6]: threw System.ArgumentOutOfRangeException Exception was thrown: System.ArgumentOutOfRangeException
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll
[Cmd Processor     ]   Groups[7]: threw System.ArgumentOutOfRangeException Exception was thrown: System.ArgumentOutOfRangeException
A first chance exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll

Clearly that's not right, because 8 matches are reported but trying to access anything about Groups[3] throws an exception. The stack trace for the exception is: System.String::Substring System.Text.RegularExpressions.Capture::get_Value TA.NetMF.WeatherServer.CommandParser::PrintMatches TA.NetMF.WeatherServer.CommandParser::ParseCommand [snip]

I have opened an issue against .NET MicroFramework

Tim Long
  • 13,508
  • 19
  • 79
  • 147
  • 3
    Why don't you turn `((\d+)|(.+))` to `(.+)` – Avinash Raj Jul 19 '15 at 14:06
  • 1
    Don't be surrised if you've found a bug, or if you're using unsupported parts of regex. Either way, could you strip this issue down to bare minimum? Are you saying if you have regex "(.+)" it wont match ".", or are you saying that if you have ((\d+)|(.+)) it won't match "." ? Your regex seems okay to me. – Erti-Chris Eelmaa Jul 19 '15 at 14:34
  • It seems to be the whole last part `(=((\d+)|(.+)))?` that is the problem. I'm still trying to narrow it down. It's Group[4] that's giving me a problem, it is empty when I come to examine it. – Tim Long Jul 19 '15 at 15:05
  • I think you might be right that this could be a bug. It's not that Group[4] is null, it's throwing an `OutOfRangeException` for that indexer - see update in question – Tim Long Jul 19 '15 at 16:09
  • 1
    @AvinashRaj: »The last part (=((\d+)|(.+)))? differentiates between a numeric payload and a text payload, or no payload at all.« Sounds about right to me. – Joey Jul 19 '15 at 16:24
  • OK this is clearly a bug, since it is reporting 8 matches then throwing exceptions on anything above 3. I have raised an issue against .Net Micro Framework: https://netmf.codeplex.com/workitem/2515 – Tim Long Jul 19 '15 at 17:20
  • I tried the suggestion to collapse my lat term to just (.+) and when that didn't work I tried (.*). Ugh! The regex code is so broken that I get different results on multiple attempts using the same input string. Sometimes it will throw on one group, other times that group will work fine only to throw on the next one. UGH! – Tim Long Jul 20 '15 at 01:38
  • Just a wild stab in the dark, perhaps it's actually a stack overflow (no, I'm not trying to be funny). Regexes make notoriously heavy use of the stack. If it were me I'd ditch the whole Regex idea and write an iterative parser. – Peter Wone Aug 11 '15 at 08:24
  • In the end, I did ditch regular expressions completely. The more I looked, the more broken I found they were in .Net Micro Framework. In one case I found that my matches were just being arbitrarily truncated after thirty or forty characters. It seems like there might be some static memory allocation going on and when it runs out, it runs out. Whatever it is, beyond a simple boolean Match/No-match indication, RegEx's are very crippled in .net MF even though the API suggests otherwise. I think the code might have been written by an intern on his lunch break :( – Tim Long Aug 11 '15 at 21:50

1 Answers1

1

Dot matches everything. "(=((\d+)|(.+)))?>" means 1. Create a tagged expression (the trailing '?' means it's optional). 2. It must start with an equal sign, and contain either 2.1. An integer, or 2.2. anything, of any size.

2.2 will match the rest of the expression, no matter what it is.

Then, when the time comes to match the trailing closing '>', if what followed '=' wasn't an integer, there's nothing left in the buffer. Ergo, no match.

Perhaps you could try something like the following instead for the last part:

"(=([^>]+))?>".

tamlin
  • 21
  • 1
  • Hmmm, Isn't `.+` non greedy by default? Therefore it should only match up to the next literal, no? Anyway, I'm desperate enough to try anything at this stage, so I did try it, and it still fails (interestingly it doesn't see the closing angle bracket). – Tim Long Jul 20 '15 at 01:52