0

I'm trying to make a tool like ANTLR from scratch in Swift (just for fun). But I don't understand how grammar knows that there should be no whitespaces (identifier example: "_myIdentifier123"):

Identifier
 : Identifier_head Identifier_characters?

And there should be whitespaces (example "is String"):

type_casting_operator
  : 'is' type
  | 'as' type
  | 'as' '?' type
  | 'as' '!' type
  ;

I've searched for WS in ANTLR's source code, but found nothing. There is no "WS" string in java code: https://github.com/antlr/antlr4

Can anyone explain the algorithm behind this? How it decides whether tokens are separated with whitespaces or not?

artyom.razinov
  • 610
  • 3
  • 17
  • 2
    Do you know about this project: https://github.com/janyou/ANTLR-Swift-Target? Also swift target discussed here: https://github.com/antlr/antlr4/issues/945 – Ivan Kochurkin Mar 16 '16 at 12:59

2 Answers2

3

Good luck with that project. Without knowing even the most basic algorithms this non-trivial task of creating a parser generator becomes even more ambitious. You should at least read a book or two about the matter (a classic is the Dragon Book, from Aho, Sethi + Ullmann).

But back to your question. The principle is that: whitespaces need to be handled like any other input, but usually you will find a WS or Whitespace lexer rule in the grammar which matches various types of whitespaces (space, line breaks, tab etc.) and puts them on a hidden channel. The parser only sees tokens from the standard channel and hence never gets the whitespaces as tokens. This is the most common approach because the existance of whitespaces usually doesn't matter (except for separating two lexical entries that need to be recognized as 2 different tokens).

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
  • That book is useful for my task, I've read first three chapters and this is enough for lexical analysis (this means that I don't have to read other chapters now). However this book doesn't provide complete view on lexical analysis, different parsing algorithms, architecture of lexical analysis software for parsing complicated languages. – artyom.razinov Mar 13 '16 at 18:16
  • And back to the question, I understand that there are two types of production rules: lexer rules and parser rules. If the rule consist of sequence of terminals and nonterminals, then if it is a lexer rule, then whitespaces aren't allowed between them, and if it is a parser rule, then they can be separated with 0 or more whitespace symbols. Am I right? Is it a complete algorithm of working with whitespaces? – artyom.razinov Mar 13 '16 at 18:19
  • Read my answer again. It hasn't to do with being a lexer or parser rule. Whitespaces must be tokenized like any other input and you can decide what to do with a whitespace token. If you put it on a hidden channel the parser won't see it, otherwise you have to handle them like any other token in your parser grammar. Btw: production rules only concern a parser. A lexer has no productions, it's simply a tokenizer. – Mike Lischke Mar 13 '16 at 22:19
2

The first rule is a lexer rule (note the capital first letter), while the second rule is a parser rule.

The white space token typically is not passed to the parser (in this case there must be a rule to skip white space in the lexer), so the second rule does not see it. Whitespace can appear anywhere between other tokens.

Lexer rules in contrast see all characters from the input, so any white space must be matched explicitly.

Henry
  • 42,982
  • 7
  • 68
  • 84
  • So if parser rule consist of for example 3 subrules (sequence of 3 terminals or nonterminals), then I should expect 0 or more whitespaces or comment sections between each? Is it a complete algorithm with no exceptions? – artyom.razinov Mar 13 '16 at 18:22
  • Yes, if you define white space and comments to be skipped, they are completely invisible to the parser and can appear between any two tokens. – Henry Mar 13 '16 at 19:16