How to deal with overlapping character groups in different tokens in an EBNF grammar?

Question

I'm using an LL(k) EBNF grammar to parse a character stream. I need three different types of tokens:

CHARACTERS

  letter = 'A'..'Z' + 'a'..'z' .
  digit = "0123456789" .
  messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')' .

TOKENS

  num = ['-'] digit { digit } [ '.' digit { digit } ] .
  ident = letter { letter | digit | '_' } .
  message = messageChar { messageChar } .

The first two token declarations are fine, because they don't share any common characters.

However the third, message, is invalid because it's possible that some strings could be both num and message (such as "123"), and other strings could be both an ident and a message (such as "Hello"). Hence, the tokenizer can't differentiate correctly.

Another example is differentiating between integers and real numbers. Unless you require all real numbers to have at least one decimal place (meaning 1 would need to be encoded as 1.0, which isn't an option for me) then I can't get support in the grammar for the differences between these two numeric types. I've had to go for all values being expressed as reals and doing the checking after the point. That's fine, but sub-optimal. My real problem is with the message token. I can't find a workaround for that.

So the question is, can I do this with an LL(k) EBNF grammar? I'm using CoCo/R to generate the parser and scanner.

If I can't do it with LL(k) EBNF, then what other options might I look into?

EDIT This is the output I get from CoCo/R:

Coco/R (Apr 23, 2010)
Tokens double and message cannot be distinguished
Tokens ident and message cannot be distinguished
...
9 errors detected

Andre Artus · Accepted Answer · 2011-04-09T18:56:41.833

Try this:

CHARACTERS

    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
    message = messageChar { messageChar } CONTEXT (")") .

Oh, I have to point out that '\u0020' is the unicode SPACE, which you are subsequently removing with "- ' '". Oh, and you can use CONTEXT (')') if you don't need more than one character lookahead. This does not work in your case seeing as all the tokens above can appear before a ')'.

FWIW: CONTEXT does not consume the enclosed sequence, you must still consume it in your production.

EDIT:

Ok, this seems to work. Really, I mean it this time :)

CHARACTERS
    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
//    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
//    message = letter { messageChar } CONTEXT (')') .

// MessageText<out string m> = message               (. m = t.val; .)
// .

HearExpr<out HeardMessage message> =
    (.
        TimeSpan time; 
        Angle direction = Angle.NaN; 
        string messageText = ""; 
    .)
    "(hear" 
    TimeSpan<out time>
        ( "self" | AngleInDegrees<out direction> )
//         MessageText<out messageText>
    {
        ANY (. messageText += t.val; .)
    }
    ')'
    (. 
        message = new HeardMessage(time, direction, new Message(messageText)); 
    .)
    .

ANY will read character until it hits ')' or whitespace. I put it in a loop concatenating each value, but you may not want to do that. You may want to have it in a loop though so that it doesn't return "over" when it sees "over here", but "here". You can do a simple length check on messageText, or other validity checks such as adding t.val to a List and checking the count. Anything really. You can also do a test with a RegEx to make sure it complies with whatever pattern you need to check against.

EDIT (8 Apr 2011): Example using Coco/R with integers and reals

COMPILER Calculator
CHARACTERS
    digit       = "0123456789".

TOKENS
    intNumber    = ['-'] digit { digit } .
    realNumber   = ['-'] { digit } "." digit { digit } 
                         [("e" | "E") ["+" | "-"] digit {digit}] .

PRODUCTIONS
    Calculator  = { Expression "=" } .
    Expression  = Term { "+" Term | "-" Term }.
    Term        = Factor { "*" Factor | "/" Factor }.
    Factor      = intNumber | realNumber .

END Calculator.

EDIT (9 Apr 2011)

Factor<out double value>
    (. value = 0.0; .)
= 
    ( 
        intNumber 
        (. value = Convert.ToDouble(t.val); .)
        | 
        realNumber 
        (. value = Convert.ToDouble(t.val); .)
    ) 
    | "(" Expression<out value> ")"         
.

or

Factor<out double value>
    (. value = 0.0; .)
=
    ( intNumber | realNumber ) 
    (. value = Convert.ToDouble(t.val); .)
    | "(" Expression<out value> ")"
.

Hey Andre, thanks for this. I've only just gotten around to revisiting this issue and testing your code out. I tried a bunch of things but your answer was the only one that worked. It seems that the CoCo/R scanner is fairly limited. For example, it's not possible to have tokens for int and float types, as they overlap in the same way. Anyway, thanks again! — Drew Noakes, Apr 08 '11 at 13:23
Coco/R is limited in that it is LL(1), but you can have tokens for ints and floats, as shown in the example I added. You just need a way to differentiate. — Andre Artus, Apr 08 '11 at 14:43
Ah ok I see what I was missing. I guess what I was hoping for was that a `realNumber` token might have an optional decimal place and real part. In your example `1234` is not a real number, even though mathematically it is. I guess you could define a production `Real` that has the same definition as `Factor` in your update. I will play with this some more soon. Thanks again for your expert help with this. I'd vote for you twice if I could. — Drew Noakes, Apr 08 '11 at 23:30
@Drew: You are not really bound by types at this level. You could define a number token that matches ints, reals, etc. Then compute types in the parser or by walking the AST. You can promote ints to reals where it makes sense to do so. I'll put up another example. — Andre Artus, Apr 09 '11 at 18:35
@Andre, thanks for your update. What you're displaying isn't so far from what I am doing right now. I was hoping to capture the fact that a real is invalid where an int is required, and have CoCo/R throw some kind of error without having to explicitly do so in my attributions. I might have a play with this, but as I'm parsing s-expressions, I'm hoping I can find a better solution than CoCo/R (specifically one that doesn't require loading the entire string before parsing.) — Drew Noakes, Apr 10 '11 at 04:08
@Drew: You would of course only check for both classes (int & real) where both are valid, and check for int in the production where only ints are valid. It would actually be simpler to hand code a scanner/parser for the [Robots 3D] language: the rules are very simple. Last year I sent an email to an address you had on your website, I don't know whether you received it. — Andre Artus, Apr 10 '11 at 07:24
@Drew: Most parsers work with scanners that produce a stream of tokens, but it does not have to be so. You can write a scanner/parser pair such that the parser requests a specific class depending on context, e.g. GetWS, GetInteger, GetReal, GetKeyword, GetIdent --as opposed to just GetToken. Similar to what you would do if you wrote a normal recursive-descent scanner, but you make these methods public. You can still have a catchall (GetToken) if it makes sense. I'll be happy to write an example for you. — Andre Artus, Apr 10 '11 at 07:38

score 2 · Answer 2 · answered Jun 21 '10 at 06:22

2

You may want to look into a PEG generator which has context sensitive tokenization.

http://en.wikipedia.org/wiki/Parsing_expression_grammar

I cannot think of a way you will get around this using COCO/R or similar, as each token needs to be unambiguous.

If messages were surrounded by quotes, or some other way of disambiguating then you would not have a problem. I really think PEG may be your answer, as it also has ordered choice (first match).

Also take a look at:

http://tinlizzie.org/ometa/

answered Jun 21 '10 at 06:22

Andre Artus

1,850
15
21

Awesome. This sounds exactly like what I need. I managed to put this off until now, so your answer is timed perfectly. I was considering merging all tokens into a generic 'symbol' definition, but what this sounds much better. Will let you know how I get on. Can you comment upon any potential performance impact? – Drew Noakes Jun 21 '10 at 10:10
You may find it it slightly slower, depending on the parser generator. It should be really easy to craft something by hand if speed is a concern. If you can tell me which language/platform you intend to build against (e.g. Java/JVM, C#/.NET, C++) then I may be able to make some recommendations. – Andre Artus Jun 21 '10 at 20:38
@Drew: If you can put up a sanitized example of the input you want to process, then that may help too. When I design a DSL I tend to write a few samples first and work back from there (the samples also come to serve as input for some unit tests). – Andre Artus Jun 21 '10 at 20:52
@Andre: Actually I'm parsing someone else's format. It's a series of SExpressions. Each SExpression should be turned into a different object type. I asked a different question (http://stackoverflow.com/questions/3051254/) about parsing SExpressions explicitly, as maybe a full-blown grammar isn't necessary for such a simply structured data format. You can see examples of the data here: http://simspark.sourceforge.net/wiki/index.php/Perceptors there are several repeating patterns. For example, `(pol )` should map to my `PolarCoordinate` type. – Drew Noakes Jun 23 '10 at 16:10
I have a character stream of SExpressions: `(...)(...)(...)...`. Ideally I'd like to process the stream directly, and spit out one object for each expression in the series. – Drew Noakes Jun 23 '10 at 16:13
@Drew: What do you want to do when there is incorrect data on the steam? That is, do you need some kind of error recovery, or do you bail out? – Andre Artus Jun 23 '10 at 20:47
It will be a bit difficult to describe a possible solution in the commensts, so I might have to either ammend my original answer, or create a new one. The format looks very simple (if it's exactly like the "Perceptors" one). I would not even bother with a parser generator, it going to take longer to sort that out than code the solution by hand. – Andre Artus Jun 23 '10 at 21:10
@Drew, what language are you coding in? If it's something I know I may be able to give you code you can use. – Andre Artus Jun 23 '10 at 21:11
@Drew: is it possible to start reading the stream in the middle of an incomplete message e.g. "torso) (rt 0.01 0.07 0.46))". That is, is the stream character, or message, based? – Andre Artus Jun 23 '10 at 23:42
Hi @Andre, the stream is message based. I read a four-byte length value, then read that many one-byte ASCII characters, then the same again. So it's not possible to start reading from the middle of a stream (good question though), and if something goes wrong in the parsing then recovery would be to fast-forward the appropriate number of bytes and start the next message. Recovery might just move forward to the next top-level SExpression. The data comes over TCP from a server that doesn't have too many surprises in store so I'm not overly concerned about errors in the stream. – Drew Noakes Jun 24 '10 at 03:07
Actually we're kind of diverging away from the question above into the other question I posted. If you think you need a second answer, then you might post there (http://stackoverflow.com/questions/3051254/). I'm developing this in C# and already have a parser generator that works, except for the HearPerceptor (http://simspark.sourceforge.net/wiki/index.php/Perceptors#Hear_Perceptor) which has a 20-byte payload of characters ranging [0x20; 0x7E]. I can't make a rule at the token level that covers that range without overlapping with `ident` and `num`. – Drew Noakes Jun 24 '10 at 03:12
The project I'm working on is open source. You can see the parser grammar here: http://code.google.com/p/tin-man/source/browse/trunk/TinMan/PerceptorParsing/perceptors.atg One of the reasons I am looking at using a grammar is because there's another SExp format I might need to parse later which is a bit more involved: http://simspark.sourceforge.net/wiki/index.php/Network_Protocol#Server.2FMonitor_Communication BTW you seem to be a parsing expert, and I really appreciate you taking the time to help me with this. – Drew Noakes Jun 24 '10 at 03:19
@Drew: It is good to know what your inputs are. I had devised a whole scheme to handle partial data, and now you don't need it :D. I agrree that this is starting to diverge away from the question above, and I will be happy to post the answer in another question. Let me take a look at what you have as there may be ways to sort it out with now that I know what we are dealing with. As to being an expert, I'm actually more of an enthusiast. If you feel that this has run its course then perhaps it's time to close the question. I will post in either the linked Q or a new more specific one. – Andre Artus Jun 24 '10 at 04:05
@Drew: You seem to be working on some cool stuff. Is this an AI project? – Andre Artus Jun 24 '10 at 04:06
I see you have quotes around your "MessageText" which is not part of the original spec, is this to get around some issues? – Andre Artus Jun 24 '10 at 04:14
Is there a reason why you are not wrapping with BufferedStream? – Andre Artus Jun 24 '10 at 04:25
The sun is coming up, so it is time for me to go to bed. I will coninue looking at this tonight. – Andre Artus Jun 24 '10 at 04:38
@Drew: I posted an answer on the linked question. But come to think of it it may apply here too. – Andre Artus Jun 24 '10 at 05:34

score 1 · Answer 3 · answered Jun 15 '10 at 14:44

1

Despite the title, this all seems to relate to the scanner, not the parser. I haven't used CoCo/R, so I can't comment on it directly, but in a typical (e.g., lex/Flex) scanner, rules are considered in order, so the rule/pattern that's chosen is the first one that matches. Most scanners I've written include a '.' (i.e., match anything) as their last pattern, to display an error message if there's some input that doesn't match any other rule.

answered Jun 15 '10 at 14:44

Jerry Coffin

476,176
80
629
1,111

In CoCo/R you specify the tokens and grammar all in one file. CoCo/R seems to be checking for this ambiguity. I've tried reordering my declarations but haven't seen any difference. I'll try a few more times. – Drew Noakes Jun 15 '10 at 14:59

How to deal with overlapping character groups in different tokens in an EBNF grammar?

3 Answers3