1

I'm working on program that uses DocumentBuilder to parse an old HTML file so that it can be processed accordingly. Within this HTML file, we have the following

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

Here's the code snippet that does the reading:

DocumentBuilderFactory documentBuilderFactory;
DocumentBuilder documentBuilder;

documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilder = documentBuilderFactory.newDocumentBuilder();

Document doc = documentBuilder.parse(htmlSource);

The parsing then fails with the following error:

Error 1:    The declaration for the entity "HTML.Version" must end with '>'.
      Column Number:    3
      System Identifer: null
      toString:         org.xml.sax.SAXParseException; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
      Line Number:      31
      Public Identifer: null
      Caused By:

      The declaration for the entity "HTML.Version" must end with '>'.
      Trace Follows:

org.xml.sax.SAXParseException; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
        at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
        at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1597)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:2021)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDExternalSubset(XMLDTDScannerImpl.java:299)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1165)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1040)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:943)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:541)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)
        at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at com.rockwellcollins.ana.xml.XmlParser.parse(XmlParser.java:490)
        at com.rockwellcollins.ana.xml.XmlParser.parse(XmlParser.java:592)
        at com.rockwellcollins.qimt.doorsmapper.doorsmapper.HtmlParser.parseHtml(HtmlParser.java:301)
        at com.rockwellcollins.qimt.doorsmapper.doorsmapper.DoorsMapper.applicationSpecificDoIt(DoorsMapper.java:232)
        at com.rockwellcollins.application.common.ApplicationBase.doIt(ApplicationBase.java:795)
        at com.rockwellcollins.qimt.doorsmapper.doorsmapper.DoorsMapper.main(DoorsMapper.java:300)

It's complaining about this section of the DTD file:

<!ENTITY % HTML.Version "-//W3C//DTD HTML 4.01 Transitional//EN"
  -- Typical usage:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
            "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    ...
    </head>
    <body>
    ...
    </body>
    </html>

    The URI used as a system identifier with the public identifier allows
    the user agent to download the DTD and entity sets as needed.

    The FPI for the Strict HTML 4.01 DTD is:

        "-//W3C//DTD HTML 4.01//EN"

    This version of the strict DTD is:

        http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd

    Authors should use the Strict DTD unless they need the
    presentation control for user agents that don't (adequately)
    support style sheets.

    If you are writing a document that includes frames, use 
    the following FPI:

        "-//W3C//DTD HTML 4.01 Frameset//EN"

    This version of the frameset DTD is:

        http://www.w3.org/TR/1999/REC-html401-19991224/frameset.dtd

    Use the following (relative) URIs to refer to 
    the DTDs and entity definitions of this specification:

    "strict.dtd"
    "loose.dtd"
    "frameset.dtd"
    "HTMLlat1.ent"
    "HTMLsymbol.ent"
    "HTMLspecial.ent"

-->

From my initial investigation, it's complaining about the -- comments within the tags. If I remove those, then the first error disappears and moves onto the next one. My question is, how come the DocumentBuilder is not able to read the DTD file correctly?

To add a few things, we are unable to remove the DTD from the HTML file because the HTML provided is HTML 4 specific and without it, the parsing fails because of the HTML 4 formatting.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
Dylan
  • 697
  • 1
  • 9
  • 27
  • 1
    That's just.... weird. Since this is a w3.org file I expect there's an explanation, but it sure doesn't look like a valid DTD to me. It looks like there's a missing `>` at the end of each `ENTITY` declaration and a missing `<!` at the start of each comment. I await an answer to this with anticipation. – Jim Garrison Jul 24 '19 at 18:46
  • I'm in the same boat. I feel like it should work since the DTD is coming from w3. This is also a broken element in the DTD according to Document Builder: ```<!ENTITY % ContentType "CDATA" -- media type, as per [RFC2045] -->``` But if I remove the comments and have it like so: ```<!ENTITY % ContentType "CDATA" >``` There is no issue. – Dylan Jul 24 '19 at 18:50
  • 1
    See https://stackoverflow.com/q/33230063/18157, it appears the DTD is valid SGML, but the DTD parser only accepts the XML DTD subset of SGML syntax. I'm going to mark this as a duplicate of that question. This is a nice "signpost" question that will guide others to the answer. Unfortunately it doesn't provide a solution, since the answer appears to be that "modern" DTD parsers are not backward-compatible with older syntax. This is a bug in the DTD parser IMHO. – Jim Garrison Jul 24 '19 at 19:10
  • 1
    I decided not to mark this as a duplicate for now since it's Java 13 specific... there may have been a change introduced sometime after Java 8 that breaks this. I'll be curious to see what others have to say, and am adding the [tag:xml] tag to make it more visible to XML experts. Good luck, I hope there's a solution out there. – Jim Garrison Jul 24 '19 at 19:20

1 Answers1

1

The HTML 4.01 is an SGML DTD (XML is a subset of SGML) and HTML can't be parsed using an XML parser. You're right that the commenting syntax in SGML allows for comments appearing in markup declarations anywhere and multiple times, in contrast to XML. For example, the following is a valid SGML element declaration:

<!ELEMENT e - - (#PCDATA)
  -- declaration for e --
  -- ... other comment -->

The declaration also hints at one of the features the XML subset of SGML doesn't support (but needed for parsing HTML), namely tag inference (tag omission). The - O sequence following the element name e means that e allows end-element tag omission ("O" as in letter O) but no start-element omission ("-" hyphen-minus). Other needed features that XML doesn't support are SGML/HTML-style empty elements such as img (without an end-element tag) and attribute minimization (as in <div hidden>).

imhotap
  • 2,275
  • 1
  • 8
  • 16
  • Thank you for your response. This ended up being the issue. Between what you and Jim said, I've ruled out using this as a parser and have moved onto JSoup :) – Dylan Jul 25 '19 at 14:53