Are ">>"s in type parameters tokenized using a special rule?

Question

I'm confused by the Java spec about how this code should be tokenized:

ArrayList<ArrayList<Integer>> i;

The spec says:

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would.

As I understand it, applying the "longest match" rule would result in the tokens:

ArrayList
<
ArrayList
<
Integer
>>
i
;

which would not parse. But of course this code is parsed just fine.

What is the correct specification for this case?

Does it mean that a correct lexer must be context-free? It doesn't seem possible with a regular lexer.

Related: http://stackoverflow.com/questions/2623966/java-syntax-of/2624125#comment2646754_2624125 — Matt Fenwick, May 28 '13 at 23:37
I assume you meant `i` instead of `1` in your list of tokens. — rgettman, May 28 '13 at 23:42
@johnchen902 is it a bug? I hadn't considered that possibility. I don't really think it is, though. — Matt Fenwick, May 29 '13 at 00:57
I don't think it's a bug. Maybe bug in documentation. You can tell how are generics parsed from the following code: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/com/sun/tools/javac/parser/JavacParser.java#JavacParser.typeArguments%28boolean%29 — MartinTeeVarga, May 29 '13 at 01:02
@smartus I see, their approach makes sense. Do you mind if I write that up as an answer or would you prefer to do it? — Matt Fenwick, May 29 '13 at 01:12
To be honest I don't know how to write the answer properly. I understand the code and I remember a few things about grammars from Uni, but I'd prefer a person who actually understands the whole picture to write the answer. You do it ;) — MartinTeeVarga, May 29 '13 at 01:17
it has been a well known problem in C++ - http://stackoverflow.com/a/71706/2158288 no doubt the language designers are very aware of this problem, but they don't seem to be concerned enough to bring it up in the spec. — ZhongYu, May 29 '13 at 01:33
At the beginning of generic type parameters in java, one had to insert a space in `> >`, — Joop Eggen, Feb 05 '20 at 15:56

Matt Fenwick · Accepted Answer · 2013-05-29T02:03:03.207

4

Based on reading the code linked by @sm4, it looks like the strategy is:

tokenize the input normally. So A<B<C>> i; would be tokenized as A, <, B, <, C, >>, i, ; -- 8 tokens, not 9.
during hierarchical parsing, when working on parsing generics and a > is needed, if the next token starts with > -- >>, >>>, >=, >>=, or >>>= -- just knock the > off and push a shortened token back onto the token stream. Example: when the parser gets to >>, i, ; while working on the typeArguments rule, it successfully parses typeArguments, and the remaining token stream is now the slightly different >, i, ;, since the first > of >> was pulled off to match typeArguments.

So although tokenization does happen normally, some re-tokenization occurs in the hierarchical parsing phase, if necessary.

edited May 29 '13 at 02:03

answered May 29 '13 at 01:26

Matt Fenwick

48,199
22
128
192

Why compiler don't re-tokenize things such as `a--b`? – johnchen902 May 30 '13 at 23:34
@johnchen902 why should it? Retokenization isn't a general strategy for rescuing failed parses, it's only used in one special case so that you don't have to write `A >`. – Matt Fenwick May 31 '13 at 03:12

score 1 · Answer 2 · answered Feb 05 '20 at 15:47

Java 10 Language Specification (3.2 Lexical Translations) states:

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (§4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >.
The input characters a--b are tokenized (§3.5) as a, --, b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program.
Without the rule for > characters, two consecutive > brackets in a type such as List<List<String>> would be tokenized as the signed right shift operator >>, while three consecutive > brackets in a type such as List<List<List<String>>> would be tokenized as the unsigned right shift operator >>>. Worse, the tokenization of four or more consecutive > brackets in a type such as List<List<List<List<String>>>> would be ambiguous, as various combinations of >, >>, and >>> tokens could represent the >>>> characters.

The earlier versions of C++ too apparently suffered from this and hence required at least one blank space between the two adjacent less than(<) and greater than(>) symbols like vector <vector<int> >. Fortunately, not any more.

Are ">>"s in type parameters tokenized using a special rule?

2 Answers2

Linked