Reading entire html file to String?

Question

Are there better ways to read an entire html file to a single string variable than:

    String content = "";
    try {
        BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
        String str;
        while ((str = in.readLine()) != null) {
            content +=str;
        }
        in.close();
    } catch (IOException e) {
    }

score 29 · Answer 1 · answered Aug 20 '12 at 09:42

29

You should use a StringBuilder:

StringBuilder contentBuilder = new StringBuilder();
try {
    BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
    String str;
    while ((str = in.readLine()) != null) {
        contentBuilder.append(str);
    }
    in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();

answered Aug 20 '12 at 09:42

Jean Logeart

52,687
11
83
118

for me this does't work since the whole html content coming back as a one single string and the in.readLine() just read the whole content for the first call – narancs Aug 04 '17 at 20:34
1

how does it know where mypage.html is located? – CraZyDroiD Jan 22 '19 at 05:54
@CraZyDroiD you need to pass the relative path to the project root folder. For example, if your mypage.html was located inside the root folder, right next to /src, you can just do "mypage.html" but if you put it in a folder, you'd have to reference that folder too, as in "/myfolder/mypage.html" – Lucas Mendonca May 23 '21 at 16:59

Johan Sjöberg · Accepted Answer · 2012-08-20T09:47:21.623

28

There's the IOUtils.toString(..) utility from Apache Commons.

If you're using Guava there's also Files.readLines(..) and Files.toString(..).

edited Aug 20 '12 at 09:47

answered Aug 20 '12 at 09:39

Johan Sjöberg

47,929
21
130
148

2

The first link is dead – SpringLearner May 18 '16 at 11:31
1

Both links are dead now. – Muhammad Ramzan Jan 14 '19 at 14:30

score 7 · Answer 3 · answered Aug 20 '12 at 09:43

7

You can use JSoup.
It's a very strong HTML parser for java

answered Aug 20 '12 at 09:43

SAbbasizadeh

730
10
25

score 5 · Answer 4 · answered Sep 03 '18 at 19:15

As Jean mentioned, using a StringBuilder instead of += would be better. But if you're looking for something simpler, Guava, IOUtils, and Jsoup are all good options.

Example with Guava:

String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();

Example with IOUtils:

InputStream in = new URL("/path/to/mypage.html").openStream();
String content;

try {
   content = IOUtils.toString(in, StandardCharsets.UTF_8);
 } finally {
   IOUtils.closeQuietly(in);
 }

Example with Jsoup:

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();

or

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();

NOTES:

Files.readLines() and Files.toString()

These are now deprecated as of Guava release version 22.0 (May 22, 2017). Files.asCharSource() should be used instead as seen in the example above. (version 22.0 release diffs)

IOUtils.toString(InputStream) and Charsets.UTF_8

Deprecated as of Apache Commons-IO version 2.5 (May 6, 2016). IOUtils.toString should now be passed the InputStream and the Charset as seen in the example above. Java 7's StandardCharsets should be used instead of Charsets as seen in the example above. (deprecated Charsets.UTF_8)

score 4 · Answer 5 · edited Jan 07 '20 at 19:36

4

I prefers using Guava :

import com.google.common.base.Charsets;
import com.google.common.io.Files;
File file = new File("/path/to/file", Charsets.UTF_8);
String content = Files.toString(file);

edited Jan 07 '20 at 19:36

Jake Perkins

115
2
16

answered Aug 20 '12 at 09:46

jknair

4,709
1
17
20

Note: a ) is missing after the filepath. – logi0517 May 15 '19 at 09:34

score 3 · Answer 6 · answered Aug 20 '12 at 09:42

For string operations use StringBuilder or StringBuffer classes for accumulating string data blocks. Do not use += operations for string objects. String class is immutable and you will produce a large amount of string objects upon runtime and it will affect on performance.

Use .append() method of StringBuilder/StringBuffer class instance instead.

score 0 · Answer 7 · answered Nov 21 '18 at 21:37

Here's a solution to retrieve the html of a webpage using only standard java libraries:

import java.io.*;
import java.net.*;

String urlToRead = "https://google.com";
URL url; // The URL to read
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
 url = new URL(urlToRead);
 conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod("GET");
 rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
 while ((line = rd.readLine()) != null) {
  result += line;
 }
 rd.close();
} catch (Exception e) {
 e.printStackTrace();
}

System.out.println(result);

SRC

score 0 · Answer 8 · answered May 25 '21 at 13:40

 import org.apache.commons.io.IOUtils;
 import java.io.IOException;     
    try {
               var content = new String(IOUtils.toByteArray ( this.getClass().
                        getResource("/index.html")));
            } catch (IOException e) {
                e.printStackTrace();
            }

//Java 10 Code mentioned above - assuming index.html is available inside resources folder.

Reading entire html file to String?

8 Answers8

Linked