IText Pdf creation from Html fails if HTML contains special/illegal characters

Question

Am using itext to create pdf from html content. I build html content in the form of table using java String buffer. A Map contains metadata values of the files in the form of key value pairs. I iterate these key and values to build the html table. The problem is some of the metadata values in map are meaningless/invalid symbols. So pdf creation fails with following exception.

java.io.IOException: Expected > for tag: <{1}/> near line 1, column 717
at com.lowagie.text.xml.simpleparser.SimpleXMLParser.throwException(SimpleXMLParser.java:568)
    at com.lowagie.text.xml.simpleparser.SimpleXMLParser.go(SimpleXMLParser.java:331)
    at com.lowagie.text.xml.simpleparser.SimpleXMLParser.parse(SimpleXMLParser.java:579)
    at com.lowagie.text.html.simpleparser.HTMLWorker.parse(HTMLWorker.java:141)


Content which caused the exception is 
“$é6èŽšÆuCÅ ©À SÀF;r 1Ì/XQ‡,Ô<ÒÐ"‡(¢ËÄòÅ1¡Ø€ÌÅc

So my question is what are these characters(Non-Ascii,utf-unsupported)? Is there any way to identify and skip them while building html?

The only bad character is the `<` here, which should not appear in your HTML. Converting it to its proper escaped form `<` should fix it. — Jongware, Sep 26 '14 at 09:30
@Jongware: am escaping all possible html characters. After escaping the content is "“$é6èŽšÆuCÅ ©À SÀF;r 1Ì/XQ‡,Ô<ÒÐ"‡(¢ËÄòÅ1¡Ø€ÌÅc" Even then it fails.. — Vijay, Sep 26 '14 at 09:38
"It fails" is **not** a helpful description of the problem. Your original error was `Expected > for tag`, surely you must be getting a new error message? — Jongware, Sep 26 '14 at 09:51

score 2 · Answer 1 · answered Sep 26 '14 at 10:36

In real time it is difficult to identify and skip while building HTML You can use Apache commons-lang to escape HTML

StringEscapeUtils.escapeHtml("“$é6èŽšÆuCÅ ©À SÀF;r 1Ì/XQ‡,Ô<ÒÐ"‡(¢ËÄòÅ1¡Ø€ÌÅc")

The output of the above is

&ldquo;$&eacute;6&egrave;&#381;&scaron;&AElig;uC&Aring; &copy;&Agrave; S&Agrave;F;r 1&Igrave;/XQ&Dagger;,&Ocirc;&lt;&Ograve;&ETH;&quot;&Dagger;(&cent;&Euml;&Auml;&ograve;&Aring;1&iexcl;&Oslash;&euro;&Igrave;&Aring;c

IText Pdf creation from Html fails if HTML contains special/illegal characters

1 Answers1