How to parse a line with double character tokens

Question

I'm trying to write an xtext parser to parse a simple markup language. The markup uses double characters for styling text. !! is used for bold. I'm struggling to work out how to create the grammar, in particular how to handle the double character symbols. As an example:

The following text !!is bold! !! but not this.

I want to parse this into the following AST:

Lines
- Line
  - Text "The following text "
  - BoldText "is bold! "
  - Text " but not this."

Does anyone have any good approaches?

Should I use:

terminal BOLD: '!!'

or

Bold : '!' '!'

I'm thinking that I have to use the second rule. That to handle this I have to have single character terminals and then use parser rules for everything.

My grammar at the moment is:

  grammar org.xtext.example.mydsl.MyDsl

  import "http://www.eclipse.org/emf/2002/Ecore" as ecore

  generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"

  Lines:
      lines+=Line*
  ;

  Line:
        {Line} content+=(PlainText|BoldText)*
        NL
  ;

  PlainText:
        text =  Text
  ;

  Text returns ecore::EString:
        (CHAR|WS)+
  ;

  BoldText:
        BOLD
        {BoldText} text += PlainText*
        BOLD
  ;

  terminal BOLD: '!!';

  terminal WS: (' ' | '\t')+;

  terminal NL: '\r'? '\n';

  terminal CHAR: !(' '|'\t'|'\r'|'\n');

BUT this is getting warnings because it can match repetitions of PlainText OR (CHAR|WS)+ in Text and I don't know how to get rid of that?

I forgot to mention that I need to capture white space and split on new lines. — Daniel Walton, Oct 29 '15 at 06:21

score 1 · Accepted Answer · answered Oct 29 '15 at 06:31

I would suggest defining the terminal as '!!' (first case), however '!' followed by another '!' (second case) should also work in this use-case.

How is your parser supposed to behave in the case where you have "!!!" in a row? In this case it is likely it will group the first two "!!" and leave the third as a literal '!'. I would suggest adding the ability to escape !s, e.g., "\!", so you could have "\!!!" for a literal '!' followed by '!!' terminal. Another idea here would be to implement some form of recursion to take only the rightmost pair as the '!!' terminal.

Best of luck!

thanks for the answer. nice point, I'm not sure on that behaviour, probably the same as in markdown *****bold me*** — Daniel Walton, Oct 29 '15 at 08:32

How to parse a line with double character tokens

1 Answers1