0

I have a resource (a static html page), that I wanna use to test. But, when I get the static page, it comes with some characters encoding. I try with the class StringEscapeUtils but it doesn't work. My function:

  private HtmlPage getStaticPage() throws IOException, ClassNotFoundException {
    final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "UTF-8");
    final StringWebResponse response = new StringWebResponse(StringEscapeUtils.unescapeHtml4(IOUtils.toString(reader)), StandardCharsets.UTF_8, new URL(URL_PAGE));
    return HTMLParser.parseHtml(response, WebClientFactory.getInstance().getCurrentWindow());
}

import org.apache.commons.lang3.StringEscapeUtils;

laaf
  • 131
  • 7
  • What does ' doesn't work' mean? Can you attach you page? What version of HtmlUnit do you use? – RBRi Mar 07 '18 at 16:37
  • It doesn't work cuz the page returns with the characters the same way. I can't attach the page (it's confidencial). The htmlUnit version is 2.25. Some datas of the html document: – laaf Mar 07 '18 at 17:00

1 Answers1

0
final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "UTF-8");

For the reader use the encoding of the file (from your comment i guess this is windows-1252 in your case). Then read the file into an string (e.g. use commons.io).

Then you can process it like this

final StringWebResponse tmpResponse = new StringWebResponse(anHtmlCode,
    new URL("http://www.wetator.org/test.html"));
final WebClient tmpWebClient = new WebClient(aBrowserVersion);
try {
  final HtmlPage tmpPage = HTMLParser.parseHtml(tmpResponse, tmpWebClient.getCurrentWindow());
  return tmpPage;
} finally {
  tmpWebClient.close();
}

If you still have problem please make a simple sample out of your page that shows your problem and upload it here together with your code.

RBRi
  • 2,704
  • 2
  • 11
  • 14
  • Thanks, I just change the line of the reader : final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "Windows-1252") – laaf Mar 08 '18 at 11:13