0

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:

Flex:

package com.intellij.circom;

import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;

%%

%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{  return;
%eof}

WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+

%%

{WHITESPACE}    { return TokenType.WHITE_SPACE; }
{NUMBER}        { return CircomTypes.NUMBER; }

Bnf:

{
  parserClass="com.intellij.circom.parser.CircomParser"

  extends="com.intellij.extapi.psi.ASTWrapperPsiElement"

  psiClassPrefix="Circom"
  psiImplClassSuffix="Impl"
  psiPackage="com.intellij.circom.psi"
  psiImplPackage="com.intellij.circom.psi.impl"

  elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
  elementTypeClass="com.intellij.circom.psi.CircomElementType"
  tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}

expr ::=
   expr ('+' | '-') expr
  | expr ('*' | '/') expr
  | '-' expr
  | '(' expr ')'
  | literal;
literal ::= NUMBER;

First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.

Poma
  • 8,174
  • 18
  • 82
  • 144

1 Answers1

2

Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).

Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.

As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:

expr  ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )

(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)

Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

rici
  • 234,347
  • 28
  • 237
  • 341
  • that's a great idea but I can't get it to work. It doesn't compile as is (code generation fail). I've tried to add `root ::= expr` as a first line, and now parsing `1+2` returns `expected +,-,*,/, , got '+2'` – Poma May 16 '19 at 21:11
  • @poma: i don't know how you are generating your lexical scanner, but it looks to me like it's not returning single character tokens by default. You probably have to add some rule which explicitly does that. I added that to the answer. – rici May 16 '19 at 23:45