2

I have the following fake.dtd file:

<!ELEMENT outer - - (#PCDATA, foo, bar) >
<!ELEMENT foo - o (#PCDATA) >
<!ELEMENT bar - - (#PCDATA) >

And the following SGML document:

<!DOCTYPE outer SYSTEM "fake.dtd">
<OUTER>Document Title
    <FOO>1234
    <BAR>wxyz</BAR>
</OUTER>

I am getting a validation error using nsgmls:

4:19:E: character data is not allowed here

Note that putting </OUTER> on the same line as </BAR> solves the problem; the error refers to the line-break.

Is there a way to keep the SGML as is (because I already have thousands of documents like this), but change the DTD so that it validates?

Adding another #PCDATA to the end of the outer element seems silly because that would make characters other than newline legal.

ChrisP
  • 5,812
  • 1
  • 33
  • 36

2 Answers2

1

The SGML Standard (ISO 8879:1986/A1:1988, 11.2.4) explicitly recommends to not use content models like (#PCDATA, foo, bar) (emphasis mine):

NOTE - It is recommended that “#PCDATA” be used only when data characters are to be permitted anywhere in the content of the element; that is, in a content model where it is the sole token, or where or is the only connector used in any model group.

Despite mentioning #PCDATA only as the first token in the group, your outer element type still is declared to have mixed content, so data characters can occur anywhere: that's why the line break (aka a "record end") after </BAR> is recognized as a data character instead of just a separator on the one hand, but there's no corresponding #PCDATA token to absorb it on the other hand, hence the error. (And only the omitted </FOO> end-tag circumvented the same error in the line before!)


The proper and common approach in this case would be to place the "Document Title" into an actual title element—for which one can allow omission of both the start- and end-tag:

<!ELEMENT outer - - (title, foo, bar) >
<!ELEMENT title o o (#PCDATA) >

Now

  • your document instance is valid without modification,
  • the outer content model still reflects the proper order of elements,
  • the outer element has element content (not any longer mixed content),
  • and the "Document Title" text ends up in its own title element, as it should.

(The same technique is used in several Standard DTDs, like the "General Document" example in annex E of the Standard.)

mrtnhfmnn
  • 349
  • 3
  • 4
0

Whitespace that looks innocuous is in fact significant character data, which results in an error. This is sometimes referred to as "pernicious mixed content". You have already hinted at a solution (allowing #PCDATA after the bar element):

<!ELEMENT outer - - (#PCDATA, foo, bar, #PCDATA) >

Another option is to allow #PCDATA and elements in any order (this is how mixed content must be declared in XML):

<!ELEMENT outer - - (#PCDATA|foo|bar)* >

I cannot think of anything else. It is not possible to restrict #PCDATA content to certain characters only.

mzjn
  • 48,958
  • 13
  • 128
  • 248