How do I figure out ANTLR grammar failure to parse input, special timestamp in logfile

Question

INPUT:

Mar 9 10:19:07 west info tmm1[17280]: 01870003:6:
/Common/mysaml.app/mysaml:Common:00000000: helloasdfasdf asdfadf vgnfg

GRAMMAR:

grammar scratch;
lines :       datestamp hostname level proc msgnum  module msgstring;
datestamp:    month day time;
//month :       MONTH;
day  :        INTEGER;
time :        INTEGER ':' INTEGER ':' INTEGER;
hostname :    STRING;
level :       ALPHA;
proc:         procname '[' procnum ']' ':';
procname :    STRING;
procnum :     INTEGER;
msgnum :      INTEGER ':' DIGIT':';
module :      '/' DOTSLASHSTRING ':' PARTITION ':' SESSID ':';
PARTITION:     STRING;
sessid :      HEX;
msgstring:      MSGSTRING;
DOTSLASHSTRING : [a-zA-Z./]+;
SESSID :      HEX;
INTEGER :     [0-9]+;
DIGIT:        [0-9];
STRING :      [a-zA-Z][a-zA-Z0-9]*;
HEX :         [a-f0-9]+;
//ALPHA:        [a-zA-Z]+;
ALPHA:         ('['|'(') .*? (']'|')');
MSGSTRING :   [a-zA-Z0-9':,_(). ]+ [\r];
 //         |   'Agent' MSGSTRING;
month : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
WS :          [ \t\r\n]+ -> skip;

PROBLEM: the parse tree shows that the month is populated properly, but the next item, day is not. In the parse tree, it shows day is set to the entire rest of the input. Don't see how this is possible.

Error from parser is:

line 1:4 mismatched input '9' expecting INTEGER

enter image description here

When tokens appear in red, that doesn't mean that they have been matched by the rule that's shown as their parent - it means that they've been discard while trying to match the parent. So it's not that `day` has been set to the rest of the input, but rather the entire rest of the input has been discarded while trying to find something with which to populate `day`. — sepp2k, Mar 06 '19 at 21:07
You should print out the tokens that are generated. I'm guessing that the 9 (as well as the other numbers) simply isn't tokenized as an INT and thus the rule does not match. From a glance at your lexer rules, it looks like it might be a SESSID. — sepp2k, Mar 06 '19 at 21:08
thanks for your reply. i really appreciate the help. The lexer tokens show... 1. T__0=1 T__1=2 T__2=3 T__3=4 T__4=5 T__5=6 T__6=7 T__7=8 T__8=9 T__9=10 T__10=11 T__11=12 T__12=13 T__13=14 T__14=15 T__15=16 PARTITION=17 DOTSLASHSTRING=18 SESSID=19 INTEGER=20 DIGIT=21 STRING=22 HEX=23 ALPHA=24 MSGSTRING=25 WS=26 ':'=1 '['=2 ']'=3 '/'=4 'Jan'=5 'Feb'=6 'Mar'=7 'Apr'=8 'May'=9 'Jun'=10 'Jul'=11 'Aug'=12 'Sep'=13 'Oct'=14 'Nov'=15 'Dec'=16 — Joel D, Mar 06 '19 at 23:15
not sure what the format of the lexer tokens is, but would assume that as a first pass, tokenization would happen (with some form of FSM), and then use the token number to drive the parser (for the terminal). Not sure that the T___ means. — Joel D, Mar 06 '19 at 23:25
Sorry, I meant the tokens that are produced from your input, not the token definitions in the generated code. You can display those by running `grun GrammarName tokens -tokens` or by iterating over the `TokenStream` in your Java code. — sepp2k, Mar 07 '19 at 07:50
The input in question spans two lines, while screenshot has one line. It might be better to use source code formatting for the input (like you did for grammar) in this case to make it clear whether it's in fact single line or two lines - source code formatting does not wrap lines. — Jiri Tousek, Mar 07 '19 at 08:25

Jiri Tousek · Answer 1 · 2019-03-07T08:23:55.563

The parser (i.e. rules starting with lowercase letter) and lexer (uppercase first letter) behave in a slightly different way:

The parser knows what token it expects and tries to match it (except when it has multiple alternatives - then it looks at the next tokens to see which alternative to choose)
The lexer however knows nothing about the parser rules - it matches whatever it can match to the current input. When multiple lexer rules can match a prefix of the input:
- It will match (and emit token for) the rule that matches the longest sequence
- If multiple rules can match the same sequence, the rule earlier in the file (closer to the top) wins.

So your input would most likely be tokenized to^*:

MONTH          Mar
(WS)
SESSID         9  - SESSID matches and is higher up than INTEGER
(WS)
SESSID         10
':'            :
SESSID         19
':'            :
SESSID         07
(WS)
PARTITION      west  - same as STRING but higher up - STRING will never be matched
(WS)
PARTITION      info
(WS)
PARTITION      tmm1
ALPHA          [17280]  - matches longer sequence than just '[' in rule "proc"
':'            :
(WS)
SESSID         01870003
':'            :
SESSID         6
':'            :
(WS)
DOTSLASHSTRING /Common/mysaml.app/mysaml  - longer than just '/' in rule "module"
MSGSTRING      :Common:00000000: helloasdfasdf asdfadf vgnfg  - the rest can be matched to this rule

As you can see, these are quite different tokens than your parser expects.

The bottom line is, you have too much logic in your lexer rules, namely you tried to put semantic meanings into lexer. It's not suited to that task. If a single input sequence can mean different things (like 123 might be an integer number, hex number or session ID), that distinction needs to go into parser since it can only be decided based on context (where in the sentence it occurred), and not by the content of the 123 itself. Similarly, if [17280] can either be ALPHA (whatever that is) or an INTEGER in brackets, that decision needs to go into parser because it cannot be decided solely by looking at [17280] (it's now in lexer due to the ALPHA rule).

* The likely tokenization is based on the input from your screenshot which is all on one line, while the input in question itself is on two lines - not sure whether that is intentional or a result of line wrap.

How do I figure out ANTLR grammar failure to parse input, special timestamp in logfile

1 Answers1