0

Getting the following the error in the Java program if the HTML has the following Japanese character.

ファミリーコンパクト 270ml ●植物成分使用●型番:コンパクト●容量(mL):270●しつこい油汚れをスッキリ落とします。

Using org.w3c.tidy.Tidy to parse HTML and then using org.xhtmlrenderer.pdf.ITextRenderer for generating PDF.

Error:

    ERROR:  'The content of elements must consist of well-formed character data or markup.'
Exception in thread "main" org.xhtmlrenderer.util.XRRuntimeException: Can't load the XML resource (using TRaX transformer). org.xml.sax.SAXParseException; lineNumber: 187; columnNumber: 65; The content of elements must consist of well-formed character data or markup.
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:191)
    at org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:71)
    at org.xhtmlrenderer.swing.NaiveUserAgent.getXMLResource(NaiveUserAgent.java:211)
    at org.xhtmlrenderer.pdf.ITextRenderer.loadDocument(ITextRenderer.java:134)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocument(ITextRenderer.java:149)
    at me.preekmr.Main.convertHTMLToPDF(Main.java:66)
    at me.preekmr.Main.main(Main.java:27)
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 187; columnNumber: 65; The content of elements must consist of well-formed character data or markup.
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:740)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:189)
    ... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 187; columnNumber: 65; The content of elements must consist of well-formed character data or markup.
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:659)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
    ... 8 more
  • 1
    Sounds like an encoding error. You are probably loading the signs in some encoding that does not support them leading to some of them to appear as XML-characters which makes the whole xml not well formed. – Ben Apr 12 '18 at 08:04
  • 1
    You might want to make your issue reproducible by adding code and sample input. Most likely @Ben is correct. Probably it would suffice to add some encoding indication to the output of `org.w3c.tidy.Tidy` before feeding it to `org.xhtmlrenderer.pdf.ITextRenderer`... – mkl Apr 12 '18 at 13:14

1 Answers1

0

Issue was with the output encoding of the Tidy parser.

Previously Tidy parser reads the HTML using input encoding as UTF-8 and outputs using the same UTF-8 encoding. But the org.xml.* renderer is unable to parse some of the UTF-8 characters. Hence there was the parse exception.

Now after setting the output encoding of the Tidy to ASCII, it converts non-ASCII characters to entities (character entity references) and hence it is parsable properly by XML renderer.

tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("ASCII");