Matching arbitrary text (both symbols and spaces) with ANTLR?

Question

How to match any text in ANTLRv4? I mean text, which is unknown at the time of grammar writing?

My grammar is follows:

grammar Anytext;

line :
    comment;

comment : '#' anytext;

anytext: ANY*;

WS : [ \t\r\n]+;

ANY : .;

And my code is follows:

    String line = "# This_is_a_comment";

    ANTLRInputStream input = new ANTLRInputStream(line);

    AnytextLexer lexer = new AnytextLexer(input);

    CommonTokenStream tokens = new CommonTokenStream(lexer);

    AnytextParser parser = new AnytextParser(tokens);

    ParseTree tree = parser.comment();

    System.out.println(tree.toStringTree(parser)); // print LISP-style tree

Output follows:

line 1:1 extraneous input ' ' expecting {<EOF>, ANY}
(comment # (anytext   T h i s _ i s _ a _ c o m m e n t))

If I change ANY rule

ANY : [ \t\r\n.];

it stops recognizing any symbol at all.

UPDATE1

I have no end line character at the end.

UPDATE 2

So, I understood, that it is impossible to match any text with lexer since lexer can't allow multiple classes. If I define lexer rule for any symbol it will either hide all other rules or doesn't work.

But the question persists.

How to match all symbols at parser level then?

Suppose I have table-shaped data and I wan't to process some fields and ignore others. If I had anytext rule, I would write

infoline :
    ( codepoint WS 'field1' WS field1Value ) |
    ( codepoint WS 'field2' WS field2Value ) |
    ( codepoint WS anytext );

here I am parsing rows if 2nd column contains field1 and field2 values and ignore rows otherwise.

How to accomplish this approach?

score 9 · Answer 1 · answered May 13 '13 at 14:03

9

It's important to remember that ANTLR will break up your complete input into tokens before the parser ever sees the first token (at least it behaves this way). Your lexer grammar looks like the following.

T__0 : '#'; // implicit token created due to the use of '#' in parser rule comment

WS : [ \t\r\n]+;

ANY : .;

For your input, the tokens are the following:

# (type T__0)
[space] (type WS)
T (type ANY)
h (type ANY)
i (type ANY)
s (type ANY)
_ (type ANY)
i (type ANY)
s (type ANY)
_ (type ANY)
a (type ANY)
_ (type ANY)
c (type ANY)
o (type ANY)
m (type ANY)
m (type ANY)
e (type ANY)
n (type ANY)
t (type ANY)

Your current grammar fails to parse because the WS token isn't allowed in the comment rule. It would parse this input (but may run into problems as you expand your grammar) if you used this:

// remember that '#' is its own token
anytext: (ANY | WS | '#')*;

What you could do is change comment to be a lexer rule, which consumes the # character along with whatever follows (in this case, to the end of the line):

grammar Anytext;

line : COMMENT;

COMMENT : '#' ~[\r\n]*;

WS : [ \t\r\n]+;

ANY : .;

answered May 13 '13 at 14:03

Sam Harwell

97,721
20
209
280

I don't understand, why you wrote `[space] (type WS)`. From my point of view it is also `ANY`? Why not? – Suzan Cioc May 13 '13 at 17:16
3

@SuzanCioc ANTLR never assigns more than one type to a token. The space character matches the rule `WS` and `ANY`. To resolve the ambiguity, since `WS` appears before `ANY` in the grammar the token is assigned the `WS` type. The ambiguity is resolved and the token type assigned before the parser sees the token, so the parser will never see a space character token with the type `ANY`. – Sam Harwell May 13 '13 at 17:41
What about trees? They are also prohibited in lexer? What if I write `WS : [ \t\r\n]; ANY : WS | .;`? Will space be marked both with `ANY` and `WS`? – Suzan Cioc May 13 '13 at 18:02
I this is true, then this is the answer: lexer does not allow ambiguity and trees. – Suzan Cioc May 13 '13 at 18:04

score 1 · Answer 2 · edited May 11 '13 at 14:52

1

Use following rule for line comments:

LINE_COMMENT
    :   '#' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    ;

It matches '#' and any symbol until it gets to the end of line (unix/windows line breaks).

Edit by 280Z28: here is the exact same rule in ANTLR 4 syntax:

LINE_COMMENT
    :   '#' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN)
    ;

edited May 11 '13 at 14:52

Sam Harwell

97,721
20
209
280

answered May 11 '13 at 14:03

hoaz

9,883
4
42
53

I edited your post to give exactly the same rule in ANTLR 4 syntax. On a separate note, I recommend *not* including the `'\r'? '\n'` line terminator as part of the `LINE_COMMENT` rule itself (make it consume characters up to, but not including the end of line). There are a few reasons I recommend this, but the biggest is the fact that in the current form `LINE_COMMENT` will not match a comment on the last line of a file if it's not followed by an explicit line terminator. – Sam Harwell May 11 '13 at 14:54
Why it is so complex? Is it possible to write simpler? Why my rule does not work? – Suzan Cioc May 11 '13 at 19:09
@280Z28 can you provide an answer in your way, not including end line chars? – Suzan Cioc May 11 '13 at 19:14
1

When you use `.*` rule, it "eats" line breaks and thus matches everything to the end of stream, use following if you do not want to include end line chars: `LINE_COMMENT: '#' ~[\r\n]*;` – hoaz May 11 '13 at 22:24
@hoaz I have no line break characters at the end, see the code. I am parsing string variable. – Suzan Cioc May 12 '13 at 21:52
@hoaz do you mean it is impossible to match any symbol except by negative class? What is wrong with `[ \t\r\n.]`? Will just `.` match spaces? – Suzan Cioc May 13 '13 at 07:44
1

You do not need to mix `\t\r\n` and `.` because `.` matches everything anyway. If you want everything after pound use this: `LINE_COMMENT: '#' .*;` – hoaz May 13 '13 at 11:22

Matching arbitrary text (both symbols and spaces) with ANTLR?

2 Answers2