2

Suppose we have the two lexical definitions:

  lexical DQChr = ![\"] | [\"][\"];
  lexical String = "\"" DQChr* "\"";

Then the following cases parse well:

  parse(#String,"\"a\"");
  parse(#String,"\"\u0001\"");

However, the null-character gives a parse error:

  parse(#String,"\u0000");

How can I allow a null-character in my quoted string?

1 Answers1

1

It seems you can just add \u0000 to the regex:

Version: 0.28.2
rascal>lexical DQChr = ([\u0000] | ![\"]) | [\"][\"];
ok
rascal>lexical String = "\"" DQChr* "\"";
ok
rascal>parse(#String,"\"a\"");
String: (String) `"a"`
rascal>parse(#String,"\"\u0001\"");
String: (String) `""`
rascal>parse(#String,"\"\u0000\"");
String: (String) `""`

Marijn
  • 1,640
  • 14
  • 24
  • 1
    Thank you, that does the trick. I am still a bit puzzled why the null-character was not allowed, but at least I can move on. – Steven Klusener Aug 16 '23 at 14:42
  • As a side remark, quoted strings normally do not contain null characters, if they do then somewhere in the process they got inserted which may not have been intended - so while there are valid uses, I'd say it does make sense to disallow these characters in strings. – Marijn Aug 16 '23 at 14:49
  • 1
    I have to parse legacy Cobol code which happens to contain these string constants with null-characters, so I cannot disallow these cases. – Steven Klusener Aug 17 '23 at 07:11
  • It's an obscure semantics of the character class negation to exclude the 0 character, so `![]` becomes `[\u0001-\uFFFF]` and not `[\u0000-\uFFFF]`. The underlying reason is that meta variables in concrete syntax always start and end with a 0 character, and the use of `![]` would then accidentally always lead to ambiguity in concrete syntax patterns. – Jurgen Vinju Aug 17 '23 at 09:07
  • So the semantics of `![]` explains @StevenKlusener 's puzzel. – Jurgen Vinju Aug 17 '23 at 09:10
  • 1
    I would use character class union to avoid stack operations during long string parsing as follows: `lexical Q = (![] || [\u0000])`. By using `||` the character classes will be merged into `[\u0000-\uFFFF]` at parser generation time. On the contrary the single `|` will keep the disjunction separate until parse time, and predict both sides at every character position in the string constant only to throw the other away again. – Jurgen Vinju Aug 17 '23 at 09:18