13

I'm confused by the Java spec about how this code should be tokenized:

ArrayList<ArrayList<Integer>> i;

The spec says:

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would.

As I understand it, applying the "longest match" rule would result in the tokens:

  • ArrayList
  • <
  • ArrayList
  • <
  • Integer
  • >>
  • i
  • ;

which would not parse. But of course this code is parsed just fine.

What is the correct specification for this case?

Does it mean that a correct lexer must be context-free? It doesn't seem possible with a regular lexer.

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
  • Related: http://stackoverflow.com/questions/2623966/java-syntax-of/2624125#comment2646754_2624125 – Matt Fenwick May 28 '13 at 23:37
  • 1
    I assume you meant `i` instead of `1` in your list of tokens. – rgettman May 28 '13 at 23:42
  • Maybe you can submit a bug. – johnchen902 May 29 '13 at 00:15
  • @johnchen902 is it a bug? I hadn't considered that possibility. I don't really think it is, though. – Matt Fenwick May 29 '13 at 00:57
  • 1
    I don't think it's a bug. Maybe bug in documentation. You can tell how are generics parsed from the following code: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/com/sun/tools/javac/parser/JavacParser.java#JavacParser.typeArguments%28boolean%29 – MartinTeeVarga May 29 '13 at 01:02
  • @smartus I see, their approach makes sense. Do you mind if I write that up as an answer or would you prefer to do it? – Matt Fenwick May 29 '13 at 01:12
  • To be honest I don't know how to write the answer properly. I understand the code and I remember a few things about grammars from Uni, but I'd prefer a person who actually understands the whole picture to write the answer. You do it ;) – MartinTeeVarga May 29 '13 at 01:17
  • it has been a well known problem in C++ - http://stackoverflow.com/a/71706/2158288 no doubt the language designers are very aware of this problem, but they don't seem to be concerned enough to bring it up in the spec. – ZhongYu May 29 '13 at 01:33
  • At the beginning of generic type parameters in java, one had to insert a space in `> >`, – Joop Eggen Feb 05 '20 at 15:56

2 Answers2

4

Based on reading the code linked by @sm4, it looks like the strategy is:

  • tokenize the input normally. So A<B<C>> i; would be tokenized as A, <, B, <, C, >>, i, ; -- 8 tokens, not 9.

  • during hierarchical parsing, when working on parsing generics and a > is needed, if the next token starts with > -- >>, >>>, >=, >>=, or >>>= -- just knock the > off and push a shortened token back onto the token stream. Example: when the parser gets to >>, i, ; while working on the typeArguments rule, it successfully parses typeArguments, and the remaining token stream is now the slightly different >, i, ;, since the first > of >> was pulled off to match typeArguments.

So although tokenization does happen normally, some re-tokenization occurs in the hierarchical parsing phase, if necessary.

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
1

Java 10 Language Specification (3.2 Lexical Translations) states:

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (§4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >.
The input characters a--b are tokenized (§3.5) as a, --, b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program.
Without the rule for > characters, two consecutive > brackets in a type such as List<List<String>> would be tokenized as the signed right shift operator >>, while three consecutive > brackets in a type such as List<List<List<String>>> would be tokenized as the unsigned right shift operator >>>. Worse, the tokenization of four or more consecutive > brackets in a type such as List<List<List<List<String>>>> would be ambiguous, as various combinations of >, >>, and >>> tokens could represent the >>>> characters.

The earlier versions of C++ too apparently suffered from this and hence required at least one blank space between the two adjacent less than(<) and greater than(>) symbols like vector <vector<int> >. Fortunately, not any more.

Seshadri R
  • 1,192
  • 14
  • 24