ANTLR rules to match unquoted or quoted multiline string

Question

I would like my grammar to be able to match either a single line string assignment terminated by a newline (\r\n or \n), possibly with a comment at the end, or a multiline assignment, denoted by double quotes. So for example:

key = value
key = spaces are allowed
key = until a new line or a comment # this is a comment
key = "you can use quotes as well" # this is a comment
key = "and
with quotes 
you can also do 
multiline"

Is that doable? I've been bashing my head on this, and got everything working except the multiline. It seems so simple, but the rules simply won't match appropriately.

Add on: this is just a part of a bigger grammar.

Given this is just a small part of the language, it would help if you add your complete grammar and some real examples of the input you're trying to parse. — Bart Kiers, Jun 16 '20 at 18:58
A concept of what I'm trying to do is here https://tbeernot.wordpress.com/2020/06/12/fcl/ but it will change as it progresses. The last committed version of the grammar is here https://bitbucket.org/tbee/tecl/src/master/tecl/src/main/antlr4/org/tbee/tecl/antlr/TECL.g4 — tbeernot, Jun 16 '20 at 19:51
If a solution does not fall into place easily enough, I simply will scrap the unquoted version and only aim for quoted multiline strings. But I figured I'd better ask first. — tbeernot, Jun 16 '20 at 20:53
This was asked and answered years ago. I can't find it now, but Sam Harwell presented a nice solution to this some time back, if memory serves. Combination of two lexer tokens as I recall, an "unterminated string" token and a "string termination" token... — TomServo, Jun 17 '20 at 22:21
That is the approach I attempted as well. But somehow terminating at newline or at string-quote keeps conflicting. I google a bit. — tbeernot, Jun 18 '20 at 06:02

Bart Kiers · Accepted Answer · 2020-06-17T05:53:11.137

Looking at your example input:

# This is the most simple configuration
title = "FML rulez"

# We use ISO notations only, so no local styles
releaseDateTime = 2020-09-12T06:34

# Multiline strings
description = "So,
I'm curious 
where this will end."

# Shorcut string; no quotes are needed in a simple property style assignment
# Or if a string is just one word. These strings are trimmed.
protocol = http

# Conditions allow for overriding, best match wins (most conditions)
# If multiple condition sets equally match, the first one will win.
title[env=production] = "One config file to rule them all"
title[env=production & os=osx] = "Even on Mac"

# Lists
hosts = [alpha, beta]

# Hierarchy is implemented using groups denoted by curly brackets
database {

    # indenting is allowed and encouraged, but has no semantic meaning
    url = jdbc://...
    user = "admin"

    # Strings support default encryption with a external key file, like maven
    password = "FGFGGHDRG#$BRTHT%G%GFGHFH%twercgfg"

    # groups can nest
    dialect {
        database = postgres
    }
}

servers {
    # This is a table:
    # - the first row is a header, containing the id's
    # - the remaining rows are values
    | name     | datacenter | maxSessions | settings                    |
    | alpha    | A          | 12          |                             |
    | beta     | XYZ        | 24          |                             |
    | "sys 2"  | B          | 6           |                             |
    # you can have sub blocks, which are id-less groups (id is the column)
    | gamma    | C          | 12          | {breaker:true, timeout: 15} |
    # or you reference to another block
    | tango    | D          | 24          | $environment                |
}

# environments can be easily done using conditions
environment[env=development] {
    datasource = tst
}
environment[env=production] {
    datesource = prd
}

I'd go for something like this:

grammar TECL;

input_file
 : configs EOF
 ;

configs
 : NL* ( config ( NL+ config )* NL* )?
 ;

config
 : property
 | group
 | table
 ;

property
 : WORD conditions? ASSIGN value
 ;

group
 : WORD conditions? NL* OBRACE configs CBRACE
 ;

conditions
 : OBRACK property ( AMP property )* CBRACK
 ;

table
 : row ( NL+ row )*
 ;

row
 : PIPE ( col_value PIPE )+
 ;

col_value
 : ~( PIPE | NL )*
 ;

value
 : WORD
 | VARIABLE
 | string
 | list
 ;

string
 : STRING
 | WORD+
 ;

list
 : OBRACK ( value ( COMMA value )* )? CBRACK
 ;

ASSIGN : '=';
OBRACK : '[';
CBRACK : ']';
OBRACE : '{';
CBRACE : '}';
COMMA  : ',';
PIPE   : '|';
AMP    : '&';

VARIABLE
 : '$' WORD
 ;

NL
 : [\r\n]+
 ;

STRING
 : '"' ( ~[\\"] | '\\' . )* '"'
 ;

WORD
 : ~[ \t\r\n[\]{}=,|&]+
 ;

COMMENT
 : '#' ~[\r\n]* -> skip
 ;

SPACES
 : [ \t]+ -> skip
 ;

which will parse the example in the following parse tree:

And the input:

key = value
key = spaces are allowed
key = until a new line or a comment # this is a comment
key = "you can use quotes as well" # this is a comment
key = "and
with quotes 
you can also do 
multiline"

into the following:

For now: multiline quoted works, spaces in unquoted string not.

As you can see in the tree above, it does work. I suspect you used part of the grammar in your existing one and that doesn't work.

[...] and am I the process inserting the actions.

I would not embed actions (target code) inside your grammar: it makes it hard to read, and making changes to the grammar will be harder to do. And of course, your grammar will only work for 1 language. Better use a listener or visitor instead of these actions.

Good luck!

No, this indeed is just one aspect of the language, the most basic even. There also are groups, tables and more. Those I've got working, it's this thing that is working havoc on my plans :-) I was trying the approach of "assignment followed by a quote: match until the next quote" (yes, issues with escaped quotes) and "assignment followed by not-a-quote: match until NEWLINE or hash". I can always use Java to clean up the matched value. Could lexer modes help there? — tbeernot, Jun 16 '20 at 18:48
Thank you. I've replaced my grammar with yours (because it is more complete already) and am I the process inserting the actions. It is a hobby project, so I'll get back to you once that is done and have feedback. For now: multiline quoted works, spaces in unquoted string not. But that is an okay change to make. — tbeernot, Jun 17 '20 at 05:02
You're welcome. Note that my example does work with unquoted strings (I added an example of this). — Bart Kiers, Jun 17 '20 at 05:53
Unquoted strings lose their spaces according to my unit tests. But that would be an okay price to pay; want spaces? Use quoted. And indeed I'm on the fence between using a listener or simple call outs. Because in a previous ANTLR implementation the listener turned out to be not much more than forwarding to the call out methods I´m now calling directly. It was just an additional layer. So yes, I agree. — tbeernot, Jun 17 '20 at 06:50
Yes, they do loose their spaces. If the grammar will only be used by 1 target language, then it's simply a case of personal preference (listener/visitor or embedded actions). I'd still go for a separate listener or visitor, but I understand the decision to go for actions. Anyway, good luck with your project! — Bart Kiers, Jun 17 '20 at 07:43
Everything parses so far :-) Not the way I hoped, but it was getting obvious my hopes were set a bit too high, and concessions had to be made. I did include the debug Java-code-in-bash-script that you wrote on medium. That actually helps a lot in finding problems! — tbeernot, Jun 18 '20 at 05:59

ANTLR rules to match unquoted or quoted multiline string

1 Answers1