2

I need to convert a org.w3c.dom.Document to org.jdom.Document

I have tried the following following..

InputStream inputStream =  new ByteArrayInputStream(str.getBytes());

Tidy tidy = new Tidy();
tidy.setMakeClean(false);
tidy.setShowWarnings(true); //tidy.setShowWarnings(false);
tidy.setTidyMark(false);
tidy.setNumEntities(true);
tidy.setQuoteAmpersand(true);
tidy.setQuoteMarks(true);
tidy.setQuoteNbsp(false);
tidy.setHideEndTags(false);
tidy.setDropEmptyParas(false);

Document tidyDOM =tidy.parseDOM(inputStream, null);
DOMBuilder domBuilder = new DOMBuilder();
org.jdom.Document jdomDoc = domBuilder.build(tidyDOM);

domBuilder.build(tidyDOM) throws the following exception:

org.jdom.IllegalNameException: The name "html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"" is not legal for JDOM/XML DocTypes: XML names cannot contain the character " ".
    at org.jdom.DocType.setElementName(DocType.java:171)
    at org.jdom.DocType.<init>(DocType.java:111)
    at org.jdom.DocType.<init>(DocType.java:144)
    at org.jdom.DefaultJDOMFactory.docType(DefaultJDOMFactory.java:118)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:332)
    at org.jdom.input.DOMBuilder.buildTree(DOMBuilder.java:170)
    at org.jdom.input.DOMBuilder.build(DOMBuilder.java:135)
    at test.JaxenTest.testParsingVisitor(JaxenTest.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
Kevin
  • 53,822
  • 15
  • 101
  • 132
Komal Goyal
  • 233
  • 5
  • 17

2 Answers2

0

Add these two settings and everything should work.

tidy.setXHTML(true);
tidy.setDocType("omit");

The first setting tells jTidy to output an XHTML file. An XHTML file is valid XML.

The second tab tells tidy not to output a DOCTYPE line into the code. For some reason JDom does not seem to recognize legitimate html/xhtml doctypes.

  • 1
    In fairness, this is not a JDOM problem. I think you will find that the DOM Document 'feeding' JDOM is inaccurate... you cannot have an element called: "html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"" – rolfl May 03 '12 at 02:32
0

It looks to me as if JTidy is creating a malformed DocType node. I suggest using a different HTML parser.

I recommend The Validator.nu HTML Parser but there are plenty of others.

Alohci
  • 78,296
  • 16
  • 112
  • 156