1

I'm using Java 11 (AdoptOpenJDK 11.0.5 2019-10-15) on Windows 10. I have some legacy XHTML 1.1 files I want to process. They take the following general form:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>

To keep the parser from waiting connecting to the Internet, I install a custom EntityResolver that loads known entities (from their public IDs, such as -//W3C//ELEMENTS XHTML Inline Style 1.0//EN) stored in the the program resources. This DefaultEntityResolver class also prints debug messages indicating which entities the parser is loading.

Here is the basic form of my parsing:

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
documentBuilder.setEntityResolver(DefaultEntityResolver.getInstance());
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
  document = documentBuilder.parse(inputStream);
}

Because of the debug messages in DefaultEntityResolver, I can see that the parser loaded the following entities, in this order.

  • -//W3C//DTD XHTML 1.1//EN (http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd)
  • -//W3C//ELEMENTS XHTML Inline Style 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod)
  • -//W3C//ENTITIES XHTML Datatypes 1.0//EN (http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod)
  • -//W3C//ENTITIES XHTML Modular Framework 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod)
  • -//W3C//ENTITIES XHTML Datatypes 1.0//EN (http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod)
  • -//W3C//ENTITIES XHTML Qualified Names 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod)
  • -//W3C//ENTITIES XHTML Intrinsic Events 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod)
  • -//W3C//ENTITIES XHTML Common Attributes 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod)
  • -//W3C//ENTITIES XHTML 1.1 Document Model 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod)
  • -//W3C//ENTITIES XHTML Character Entities 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod)
  • -//W3C//ENTITIES Latin 1 for XHTML//EN (http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent)
  • -//W3C//ENTITIES Symbols for XHTML//EN (http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent)
  • -//W3C//ENTITIES Special for XHTML//EN (http://www.w3.org/MarkUp/DTD/xhtml-special.ent)
  • -//W3C//ELEMENTS XHTML Text 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod)
  • -//W3C//ELEMENTS XHTML Inline Structural 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod)
  • -//W3C//ELEMENTS XHTML Inline Phrasal 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod)
  • -//W3C//ELEMENTS XHTML Block Structural 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod)
  • -//W3C//ELEMENTS XHTML Block Phrasal 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod)
  • -//W3C//ELEMENTS XHTML Hypertext 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-hypertext-1.mod)
  • -//W3C//ELEMENTS XHTML Lists 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-list-1.mod)
  • -//W3C//ELEMENTS XHTML Editing Elements 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-edit-1.mod)
  • -//W3C//ELEMENTS XHTML BIDI Override Element 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-bdo-1.mod)
  • -//W3C//ELEMENTS XHTML Ruby 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-ruby-1.mod)
  • -//W3C//ELEMENTS XHTML Presentation 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-pres-1.mod)
  • -//W3C//ELEMENTS XHTML Inline Presentation 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-inlpres-1.mod)
  • -//W3C//ELEMENTS XHTML Block Presentation 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-blkpres-1.mod)
  • -//W3C//ELEMENTS XHTML Link Element 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-link-1.mod)
  • -//W3C//ELEMENTS XHTML Metainformation 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-meta-1.mod)
  • -//W3C//ELEMENTS XHTML Base Element 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-base-1.mod)
  • -//W3C//ELEMENTS XHTML Scripting 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-script-1.mod)
  • -//W3C//ELEMENTS XHTML Style Sheets 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-style-1.mod)
  • -//W3C//ELEMENTS XHTML Images 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-image-1.mod)
  • -//W3C//ELEMENTS XHTML Client-side Image Maps 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-csismap-1.mod)
  • -//W3C//ELEMENTS XHTML Server-side Image Maps 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-ssismap-1.mod)
  • -//W3C//ELEMENTS XHTML Param Element 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-param-1.mod)
  • -//W3C//ELEMENTS XHTML Embedded Object 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-object-1.mod)
  • -//W3C//ELEMENTS XHTML Tables 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-table-1.mod)
  • -//W3C//ELEMENTS XHTML Forms 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-form-1.mod)
  • -//W3C//ELEMENTS XHTML Document Structure 1.0//EN (http://www.w3.org/MarkUp/DTD/xhtml-struct-1.mod)

Note that some of these entities no longer exist at the indicated URL; nevertheless my DefaultEntityResolver has these entities already stored and keyed to their public IDs, and thus still provides them to the parser.

So far so good. But when I immediately call document.normalizeDocument(), the program pauses and then prints:

[Error] xhtml11.dtd:129:43: The entity "LanguageCode.datatype" was referenced, but not declared.
[Error] xhtml11.dtd:130:44: The entity "LanguageCode.datatype" was referenced, but not declared.
[Error] xhtml11.dtd:194:47: The entity "Common.attrib" was referenced, but not declared.

Note this is not my program printing these errors; it's apparently something inside document.normalizeDocument(). In addition, here are two other curiosities:

  • This does not happen if I run my application from within Eclipse.
  • This does not happen if I disable my network connection.

My best guess is that document.normalizeDocument() is not using the custom EntityResolver I installed in the document builder. Because some of the entities no longer exist at their expected URLs (e.g. http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod), they cannot be loaded and therefore the indicated referenced entities never get defined. The web server, however, takes a long time to responsd that the entities are missing (as you can test manually), which makes the program seem to pause. This also might explain why the error messages don't appear when my network connection is disabled; I'm guessing none of the external entities can be loaded, failing immediately, but this is not considered an error. (None of this explains why this works with no pause or error message inside Eclipse, though.)

In fact the DOMConfiguration documentation hints that I need to set some sort of resource-resolver parameter, although I'm not sure why DOMConfiguration doesn't default to the entity resolver I set in the original document builder used to parse the XML document.

To make things a little stranger, I put the skeleton XHTML 1.1 document above in my resources, and created a unit test exactly like the code above, followed by document.normalizeDocument(), and the test passed with no pause and no errors, even from the command line!

But then if I put a loop for(int i = 0; i < 100; i++) in the unit test; to load, parse, and normalize the document 100 times (but using the same DocumentBuilderFactory); my unit test crashes the forked unit test JVM altogether!!

org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (default-test) on project [...]: There are test failures.

Please refer to [...]\xml\target\surefire-reports for the individual test results.
Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was cmd.exe /X /C [...]
Process Exit Code: 0
Crashed tests:
[...].XmlDomTest
org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was cmd.exe /X /C [...]
Process Exit Code: 0
Crashed tests:
com.globalmentor.xml.XmlDomTest
        at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:669)
        at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:282)
        at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:245)
        at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
        at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
        at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
        at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
        at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
        at org.apache.maven.cli.MavenCli.execute(MavenCli.java:957)
        at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:289)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:193)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:282)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:225)
        at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:406)
        at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:347)

    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
Caused by: org.apache.maven.plugin.MojoExecutionException: There are test failures.

So I'm thinking I want to avoid document.normalizeDocument(), but I welcome any clarifications of this behavior.

Garret Wilson
  • 18,219
  • 30
  • 144
  • 272
  • *"But when I immediately call `document.normalizeDocument()`"* Why would you do that? A parsed XML document is already normalized. – Andreas Mar 09 '20 at 01:03
  • @Andreas: 1) _Why_ I would do that is irrelevant. 2) There is no guarantee the XML document is already normalized; if there is, please point me to the API specification that says so. 3) If the document is already normalized, then why isn't the method call a no-op? Why is it exhibiting the behavior I mentioned? 4) You are not even addressing the distinction between `normalize()` and `normalizeDocument()`. 5) Would you like to respond directly to the question, to clarify the behavior? – Garret Wilson Mar 09 '20 at 02:48
  • I suspect that part of the answer (only part) is that the Xerces parser does some things lazily, so it might be doing things (like entity expansion) during `normalizeDocument()` that you would expect to be performed earlier. – Michael Kay Mar 09 '20 at 08:07
  • @GarretWilson #2) Javadoc of [`normalizeDocument()`](https://docs.oracle.com/javase/8/docs/api/org/w3c/dom/Document.html#normalizeDocument--): *This method acts as if the document was going through a save and load cycle, putting the document in a "normal" form.* --- That explicitly tells you that `normalizeDocument()` will have the same effect as saving the `Document` to XML file and reloading (parsing) it. Ergo, save+load will normalize. – Andreas Mar 09 '20 at 13:42
  • @GarretWilson #4) I wasn't answering your question, I was commenting on it, specifically on why you'd even make the call to `normalizeDocument()` in the first place, so why would I be addressing `normalize()` and the difference between the two? My point is that you should call either. – Andreas Mar 09 '20 at 13:42
  • Andreas please read the entire API documentation for `normalizeDocument()`. The sentence you quoted is only part of it, and in fact is only the _minimum_ it will do. (Plus you ignored the "save" part; I'm only loading the document, and the save process might make further normalizations of the DOM, such as adding needed `xmlns` attributes.) It goes on to say, "Otherwise, the actual result depends on the features being set on the `Document.domConfig` object and governing what operations actually take place." It even says to refer to `DOMConfiguration`, which has a host of options. – Garret Wilson Mar 09 '20 at 14:49
  • And `normalize()` is very relevant here, as the API docs say that `normalizeDocument()` will do everything `normalize()` does, but probably more. In all these cases, there is nothing that guarantees that a parsed XML document will be in normalized form. Nevertheless, this question was not about whether I should call `normalizeDocument()` (or `normalize()`), but rather to clarify its unexpected behavior _if I were to call it_. – Garret Wilson Mar 09 '20 at 14:52

1 Answers1

0

Not really an answer, but information you may find helpful: Saxon has built-in copies of the relevant DTD files, and uses its own EntityResolver, so I thought I would try that. It parsed the document as follows:

Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/mike/Desktop/temp/test.xhtml using class net.sf.saxon.tree.tiny.TinyBuilder
Fetching Saxon copy of w3c/xhtml11/xhtml11.dtd
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstyle-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-framework-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-datatypes-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-qname-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-events-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-attribs-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml11-model-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-charent-1.mod
Fetching Saxon copy of w3c/xhtml-lat1.ent
Fetching Saxon copy of w3c/xhtml-symbol.ent
Fetching Saxon copy of w3c/xhtml-special.ent
Fetching Saxon copy of w3c/xhtml11/xhtml-text-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstruct-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-inlphras-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-blkstruct-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-blkphras-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-hypertext-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-list-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-edit-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-bdo-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-ruby-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-pres-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-inlpres-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-blkpres-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-link-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-meta-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-base-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-script-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-style-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-image-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-csismap-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-ssismap-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-param-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-object-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-table-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-form-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-struct-1.mod
Tree built in 88.44306ms

I haven't tried building a DOM using that EntityResolver, but it's certainly possible in principle. And I haven't compared this list of entities with the list you reported.

Further information: searching the DTD entities that Saxon has local copies of, I found the entity LanguageCode.datatype declared in a number of places, including xhtml-math11-f.dtd, xhtml-math11-f-a.dtd, svg-datatypes.mod, svg11-flat.dtd, xhtml-datatypes-1.mod (which is in your list) and several others.

The list of entities present in Saxon was accumulated over a period of a couple of years and involved a lot of trial and error. There is no single definitive list at W3C. There are also many inconsistencies in the W3C collection, for example modules with no public ID, modules with several public IDs or system IDs, etc, multiple modules with the same public ID, etc. The Saxon list has been stable for a few years so it's hopefully usable now, but there's no real way of knowing.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • "The list of entities present in Saxon was accumulated over a period of a couple of years and involved a lot of trial and error. There is no single definitive list at W3C." This actually answers my separate question https://stackoverflow.com/q/60568284/421049 . I have been going through all the DTDs I know about and in turn downloading all the referenced entities and keying them to public IDs for my own entity resolver. Not so fun. – Garret Wilson Mar 09 '20 at 14:45