2

Please look at the following simplified example:

    public static void main(String[] args) {
       String html = "<html>\n" +
                    " <head></head>\n" +
                    " <body>\n" +
                    "  <div> \n" +
                    "   <p> 2 <= X </p> \n" +
                    "  </div>\n" +
                    " </body>\n" +
                    "</html>";
        Document doc = Jsoup.parse(html);                    
        System.out.println(doc.select("p").outerHtml());
    }

This prints out <p> 2 &lt;= X </p> but i am expecting the selected html part to be printed out as it was : <p> 2 <= X </p>. How can i tell jsoup not to convert the '<' symbol?

RedSea
  • 241
  • 1
  • 4
  • 10
  • 1
    Your input is not valid HTML. Jsoup is correct to escape it for you. –  Dec 16 '16 at 14:15
  • I have no control over the input. Is there a way to tell jsoup to ignore validity of html and parse the document as it is to get the output wanted? – RedSea Dec 16 '16 at 14:21

2 Answers2

6

It is possible to use jsoup.

With jSoup 1.8.3 it is possible to use the method Parser.unescapeEntities that retain the original html.

import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false));

In some previous releases this method is not present.

Read more from this link.

Community
  • 1
  • 1
M K
  • 196
  • 14
  • Thanks. That did the trick. But what does the second boolean parameter? – RedSea Dec 16 '16 at 14:36
  • It seems it does not make any difference if set to true or false? – RedSea Dec 16 '16 at 14:40
  • @RedSea - Please find details for boolean param https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#unescapeEntities-java.lang.String-boolean- – Naman Dec 16 '16 at 14:41
0

You could use the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

System.out.println(StringEscapeUtils.unescapeHtml4(doc.select("p").outerHtml()));

http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html#unescapeHtml4(java.lang.String)

Developer Guy
  • 2,318
  • 6
  • 19
  • 37