0

For example, consider the following grammar:

source_file: $ => $._expression,

_expression: $ => choice(
  $.identifier,
  $.operator
),

identifier: $ => /\w*[A-Za-z]\w*/,

operator: $ => seq(
  repeat1(seq($._expression, '\\X')),
  $._expression
)

So if I have the input string a \X b \X c \X d, I want it to match as:

(source_file
  (operator
    (identifier)
    (identifier)
    (identifier)
    (identifier)))

However, the only way I can actually get this behavior is to do the following:

operator: $ => choice(
  $._operator_2,
  $._operator_3,
  ...
),

_operator_2: $ => prec(1, seq(
  $._expression, '\\X', $._expression)
),

_operator_3: $ => prec(2, seq(
  $._expression, '\\X', $._expression, '\\X', $._expression)
),

...

So I have to hardcode all the expression lengths with precedence increasing as the length increases, and can't figure out how to write a catch-all _operator_n rule. How do I accomplish this? Some combination of specifying a conflict then assigning dynamic precedence?

ahelwer
  • 1,441
  • 13
  • 29

1 Answers1

2

I realize that this question is a bit stale at this point, but I stumbled across it as I was in the process of learning tree-sitter myself and it seemed like an interesting challenge.

Based on the expected output provided, it appears that the intent of the operator expression is to consume all expressions that are separated by \X tokens. The grammar is ambiguous because it defines the operator expression itself as one of the expressions that could be consumed by an operator expression. As a result, it is impossible for the parser to figure out how it should group the expression sequence.

You can convince tree-sitter to generate a valid parser by applying precedence and associativity, but the best you can accomplish with this is to force the parser to break the sequence up into a series of operator expressions where each expression has at most one \X operator token. The operator rule is a visible node so, instead of the expected result, you end up with something like the following (for a right-associative operator):

(source_file
  (operator
    (operator
      (operator
        (identifier)
        (identifier))
      (identifier))
    (identifier)))

However, the expected result indicates that the operator expression should never nest at all, but should produce a single operator containing a list of all the delimited expressions. The implication is that the \X tokens aren't actually delimiting any expression, as currently defined, but rather any expression other than another operator expression.

Therefore, the simplest solution to the ambiguity seems to be to separate the expressions into two types: the operator expression, and all other "non-operator" expressions. You can then define the operator expression so that it only repeats the non-operator expressions. In my testing, the following grammar rules will produce the expected output.

source_file: $ => $._expression,

_expression: $ => choice(
  $.operator,
  $._non_operator_expression,
),

_non_operator_expression: $ => choice(
  $.identifier,
  // [maybe others]
),

operator: $ => seq(
  repeat1(seq($._non_operator_expression, '\\X')),
  $._non_operator_expression,
)
Kenny Pitt
  • 46
  • 2