1

I try to read the html content from an URL. When I wan't to print the content to the console "Umlaute" like ä, ö, ü are displayed wrong.

URL url = new URL("http://www.lauftreff.de/laeufe/halbmarathon-1-2017.html");
URLConnection conn = url.openConnection();
InputStreamReader input = new InputStreamReader(conn.getInputStream(),StandardCharsets.ISO_8859_1);
BufferedReader bi = new BufferedReader(input);
String inputLine;
while((inputLine = bi.readLine()) != null){
    System.out.println(inputLine);
}

In the header of the html the information of the charset says ISO_8859_1. Also UTF-8 does not work. Has anyone an Idea what to do?

Sigma
  • 37
  • 5

1 Answers1

1

In the website the Umlaute are decoded as HTML entities. So you would need to decode those. The code below should work, but it is untested.

URL url = new URL("http://www.lauftreff.de/laeufe/halbmarathon-1-2017.html");
URLConnection conn = url.openConnection();
InputStreamReader input = new InputStreamReader(conn.getInputStream(),StandardCharsets.ISO_8859_1);
BufferedReader bi = new BufferedReader(input);
String inputLine;
while((inputLine = bi.readLine()) != null){
    inputLine = StringEscapeUtils.unescapeHtml4(inputLine);
    System.out.println(inputLine);
}
Chrisstar
  • 626
  • 5
  • 23