0

I am trying to come up with a PEG grammar that would parse a hostname according the following BNF of RFC 2396

  hostname      = *( domainlabel "." ) toplabel [ "." ]
  domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
  toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

There is no problem with domainlabel and toplabel.

The rule for hostname however, it seems, cannot be expressed in PEG.

Here is why I think so:

If we take the grammar as written in BNF the whole input is consumed by *(domainlabel ".") which doesn't know when to stop since toplabel [ "." ] is indistinguishable from it.

simplified self-contained illustration:

h = (d '.')* t '.'?
d = [dt]
t = [t]

This would parse t, d.d.t and fail on d.d.d which is totally expected, but it fails to parse t. and d.d.t. which both are valid cases.

If we add a lookahead then it would consume t. and d.d.t., but fail on d.t.t..

h = (!(t '.'?)d '.')* t '.'?
d = [dt]
t = [t]

So I am out of ideas, is there a way to express this BNF in PEG?

Community
  • 1
  • 1
Trident D'Gao
  • 18,973
  • 19
  • 95
  • 159

1 Answers1

1

If you just need to check validity, you can do it like this:

/* Unchanged */
toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
/* Diff with above */
nontoplabel   = digit | digit *( alphanum | "-" ) alphanum
/* Rephrase */
hostname      = 1*( *( nontoplabel "." ) toplabel) [ "." ]

Since nontoplabel and toplabel are distinguishable by their first character, there is no possible ambiguity in the last expression.

The transformation is one of the many possible regular expression identities:

(a | b)* a ==> (b* a)+

You can always replace b in a|b with b-a (using - as the set difference operator).

rici
  • 234,347
  • 28
  • 237
  • 341
  • I am interested about your last point on the rewrite of `a|b`. Can you give some references? – Seki Feb 08 '17 at 22:41
  • @seki: it's just simple set theory, since `|` is the union of two sets. A set doesn't have duplicate elements, so when you compute the union of A and B, you can remove from B the elements already in the union because they are in A. – rici Feb 08 '17 at 22:58
  • Thanks for the explanation :) – Seki Feb 08 '17 at 23:12