make treesitter match classname

Question

I am using Treesitter to parse Clojure code. Specifically I would like to distinguish between symbols, class names and Java Interop.

This is my grammar:

module.exports = grammar({
    name: 'clojure',
    extras: $ => [/[\s,]/],
    rules: {
        program: $ => repeat($._anything),
        _anything: $ => choice($.symbol, $.classname, $.member_access, $.new_class),

        symbol: $ => $._symbol_chars,

        classname: $ => prec.left(3, seq($._symbol_chars, repeat1($._classname_part ))),
        _classname_part: $ => prec.right(3, seq($._dot, $._symbol_chars)),

        member_access: $ => seq($._dot, $._class_chars),
        new_class: $ => prec(2, seq( choice($.symbol, $.classname), $._dot)),

        _dot: $ => /\.{1}/,
        _symbol_chars: $ =>   /[a-zA-Z\*\+\!\-_\?][\w\*\+\!\-\?\':]*/,
        _class_chars: $ => /[a-zA-Z_]\w*/
    }
})

I would expect

foo
java.lang.String
.toUpperCase
java.awt.Point.

to be parsed to

(program (
    (symbol)
    (classname)
    (member_access)
    (new_class (classname)))

But Treesitter keeps seeing (new_class (classname)) (classname) instead of (classname) for Java.lang.String. I suppose that I need some kind of greedy matching and have tried prec.right() in different places to no avail. What am I missing?

score 2 · Accepted Answer · answered Feb 11 '20 at 10:14

I'm a tree-sitter newbie, so please take that into account when processing the following :)

extras contains whitespace for this grammar. IIUC, this means that if one does not use token around seq appropriately, tree-sitter will try to account for cases where whitespace can occur between items in a seq.

For example, for:

seq($._dot, $._class_chars)

tree-sitter will try to treat as valid $._dot and $._class_chars being separated by whitespace. But IIUC that is not necessarily equivalent in Clojure to the case where they are not separated by whitespace.

It appears that token cannot be used everywhere though, so just putting it around the above sorts of uses of seq may not work. My guess is that, roughly, if all arguments to seq are tokens, token may be used around seq.

Below is an example that appears to handle the 4 test cases provided. Although what things are parsed to precisely differs, one can still make the appropriate distinctions, AFAICT.

const JAVA_ID = /[a-zA-Z_]\w*/;

module.exports = grammar({

    name: 'clojure',

    extras: $ =>
        [/[\s,]/],

    rules: {
        program: $ =>
            repeat($._anything),

        _anything: $ =>
            choice($.symbol,
                   $.member_access,
                   $.new_class),

        symbol: $ =>
            choice($._symbol_chars,
                   $.scoped_identifier),

        // XXX: approximate, see: https://clojure.org/reader
        _symbol_chars: $ =>
            /[a-zA-Z\*\+\!\-_\?][\w\*\+\!\-\?\':]*/,

        // XXX: except $ can be used too for inner classes?
        scoped_identifier: $ =>
            token(seq(JAVA_ID,
                      repeat(seq('.', JAVA_ID)))),

        // e.g. .toUpperCase
        member_access: $ =>
            token(seq('.',
                      JAVA_ID,
                      repeat(seq('.', JAVA_ID)))),

        // e.g. java.lang.String.
        new_class: $ =>
            token(seq(JAVA_ID,
                      repeat(seq('.', JAVA_ID)),
                      '.')),

    }
});

function sep1 (rule, separator) {
    return seq(rule, repeat(seq(separator, rule)));
}

Note that the version of tree-sitter-cli may matter -- I used 0.16.4. When I tried your grammar, I didn't get the same output you did.

(The scoped_identifier bit was somewhat inspired by something of the same name from tree-sitter-java's grammar.)

(On a side note, ATM it appears that questions about tree-sitter are being fielded at the tree-sitter github repository. There have been some issues there mentioning the possibility of other places for discussion, but I haven't seen anything come about yet. You might get better answers there.)

What I had been missing was the use of `token`. This is the Answer I was looking for. Thank you! — Kolja, Feb 11 '20 at 12:38

make treesitter match classname

1 Answers1