105

I'm trying to read from a text/plain file over the internet, line-by-line. The code I have right now is:

URL url = new URL("http://kuehldesign.net/test.txt");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
LinkedList<String> lines = new LinkedList();
String readLine;

while ((readLine = in.readLine()) != null) {
    lines.add(readLine);
}

for (String line : lines) {
    out.println("> " + line);
}

The file, test.txt, contains ¡Hélló!, which I am using in order to test the encoding.

When I review the OutputStream (out), I see it as > ¬°H√©ll√≥!. I don't believe this is a problem with the OutputStream since I can do out.println("é"); without problems.

Any ideas for reading form the InputStream as UTF-8? Thanks!

Tiny
  • 27,221
  • 105
  • 339
  • 599
Chris Kuehl
  • 4,127
  • 3
  • 16
  • 19
  • 1
    The HTTP protocol specifies the encoding. Why aren’t you using a library API that handles that for you? You should never have to guess the encoding like this. I don’t mean to be negative: you’re doing great! I just wonder whether there isn’t an easier way. – tchrist Feb 11 '11 at 01:25
  • 1
    I won't have access to the server which is serving the `text/plain` file, unfortunately, and it's not using a UTF-8 encoding. I wasn't aware of any good network libraries; any suggestions? – Chris Kuehl Feb 11 '11 at 01:39
  • 1
    Looking at the [docs](http://download.oracle.com/javase/6/docs/api/java/net/URL.html), I wouldn’t think you would have to specify the encoding at all. I am surprised they give you a byte stream! You do have access to underlying [URLConnection](http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html), from which you can check the Content-Encoding, then open an InputStreamReader with the correct argument. A quick check of the source doesn’t turn up anything that seems to do that for you, which seems pretty darned lame and error prone, so I probably missed something. – tchrist Feb 11 '11 at 01:48

4 Answers4

208

Solved my own problem. This line:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

needs to be:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));

or since Java 7:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
tobijdc
  • 1,215
  • 1
  • 14
  • 21
Chris Kuehl
  • 4,127
  • 3
  • 16
  • 19
  • 4
    I’m pretty sure that form of the constructor won’t raise an exception on invalid input. You need to use the with a `CharsetDecoder dec` argument. This is same Java design bug that the `OutputStreamWriter` constructors have: only one of the four actually condescends to tell you when something goes wrong. You again have to use the fancy `CharsetDecoder dec` argument there, too. The only safe and sane thing to do is to consider all other constructors deprecated, because they cannot be trusted to behave. – tchrist Feb 11 '11 at 01:22
  • 7
    Since Java 7 it is possible to write the provide the Charset as a Constant not as a String `StandardCharsets.UTF_8` – tobijdc Apr 16 '15 at 09:14
18
String file = "";

try {

    InputStream is = new FileInputStream(filename);
    String UTF8 = "utf8";
    int BUFFER_SIZE = 8192;

    BufferedReader br = new BufferedReader(new InputStreamReader(is,
            UTF8), BUFFER_SIZE);
    String str;
    while ((str = br.readLine()) != null) {
        file += str;
    }
} catch (Exception e) {

}

Try this,.. :-)

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
Rohith
  • 429
  • 4
  • 7
  • 9
    Instead of file += str, create a StringBuilder and append to that. The compiler might be able to optimize the string appending, but it's likely creating a lot of garbage – seand Aug 20 '13 at 19:35
  • 2
    If you want to convert a BufferedReader into a string, use Apache Commons, do not reinvent the wheal: String myStr = org.apache.commons.io.IOUtils.toString( myBufferedReaderInstance); – Jaime Marín Oct 11 '16 at 22:53
  • 8
    UTF8 = "utf8", nice variable ;) – Nicofisi Jan 09 '18 at 21:13
12

I ran into the same problem every time it finds a special character marks it as ��. to solve this, I tried using the encoding: ISO-8859-1

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("txtPath"),"ISO-8859-1"));

while ((line = br.readLine()) != null) {

}

I hope this can help anyone who sees this post.

1

If you use the constructor InputStreamReader(InputStream in, Charset cs), bad characters are silently replaced. To change this behaviour, use a CharsetDecoder :

public static Reader newReader(Inputstream is) {
  new InputStreamReader(is,
      StandardCharsets.UTF_8.newDecoder()
      .onMalformedInput(CodingErrorAction.REPORT)
      .onUnmappableCharacter(CodingErrorAction.REPORT)
  );
}

Then catch java.nio.charset.CharacterCodingException.

grigouille
  • 511
  • 3
  • 14