How to properly parse a multi-character token in tree-sitter scanner function

Question

Tree-sitter allows you to use an external scanner for those tokens that are tricky to parse or that depend on specific states like multiline strings.

The scanner takes a convenient lexer object with several methods that allow you to "scan" the document looking for the proper token characters. Two of the key parts of this lexer are lookahead, which tells you the next character the lexer is "lookin at" and advance, which will move the lexer pointer to the next caracter.

However, after reading the docs and several other parsers that make use of this it's still not clear to me if calling this methods will "affect" the overall tree-sitter parser of if they are just local to my function invocation.

Specially tricky is trying to parse a multi-character token (more than 2 characters in fact) because you need to "advance the lexer, consuming the potential next chars that may be part of other tokens. One possible escape is to just return false after consuming the tokens and let tree-sitter go to the next step in the parsing, but thay may skip other valid tokens that potentially depend on the characters that I already consumed. Of course I can move this parsing to the bottom of the scan function, but then maybe other shorter tokens may shadow this longer one and also produce an incorrect parsing.

As far as I know, there is no way to "rewind" the parser to undo the "consumption" of the characters, so I am not sure how to deal with this.

The tokens that I'm trying to parse are {js| for string opening and |js} for string closing.

How to properly parse a multi-character token in tree-sitter scanner function

0 Answers0