How to skip invalid characters in stream in Java/Scala?

Question

For example I have following code

Source.fromFile(new File( path), "UTF-8").getLines()

and it throws exception

Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:319)

I don't care if some lines were not read, but how to skip invalid chars and continue reading lines?

score 33 · Accepted Answer · edited Apr 15 '13 at 02:49

33

You can influence the way that the charset decoding handles invalid input by calling CharsetDecoder.onMalformedInput.

Usually you won't ever see a CharsetDecoder object directly, because it will be created behind the scenes for you. So if you need access to it, you'll need to use API that allows you to specify the CharsetDecoder directly (instead of just the encoding name or the Charset).

The most basic example of such API is the InputStreamReader:

InputStream in = ...;
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
Reader reader = new InputStreamReader(in, decoder);

Note that this code uses the Java 7 class StandardCharsets, for earlier versions you can simply replace it with Charset.forName("UTF-8") (or use the Charsets class from Guava).

edited Apr 15 '13 at 02:49

Jeanne Boyarsky

12,156
2
49
59

answered Sep 02 '11 at 08:37

Joachim Sauer

302,674
57
556
614

3

+1 for introducing me to StandardCharset. I've wanted that for so long. No more `catch (UnsupportedEncodingException e) { // never happens }` – Thilo Sep 02 '11 at 08:39
5

@Thilo: if you're stuck with Java 6, then Guava provides [the `Charsets` class](http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/Charsets.html) which does the same thing. – Joachim Sauer Sep 02 '11 at 08:40
Note: If you’re writing files you can get a similar error. You can set the same onMalformedInput: IGNORE on the CharsetEncoder too. – Richard EB Dec 11 '21 at 20:57

score 13 · Answer 2 · answered Sep 02 '11 at 16:07

13

Well, if it isn't UTF-8, it is something else. The trick is finding out what that something else is, but if all you want is avoid the errors, you can use an encoding that doesn't have invalid codes, such as latin1:

Source.fromFile(new File( path), "latin1").getLines()

answered Sep 02 '11 at 16:07

Daniel C. Sobral

295,120
86
501
681

There are unfortunately sometimes sources that provide "mostly UTF-8" and contain malformed input. And in those cases it might be acceptable to skip the broken characters and still decode the correct ones. – Joachim Sauer Oct 03 '18 at 08:09

score 1 · Answer 3 · answered Aug 13 '13 at 20:28

1

I had a similar issue, and one of Scala's built-in codecs did the trick for me:

Source.fromFile(new File(path))(Codec.ISO8859).getLines()

answered Aug 13 '13 at 20:28

Assaf Israel

478
3
11

whoa, I have no idea why this worked, but you saved my evening! – habitats Sep 16 '15 at 21:32

score 0 · Answer 4 · answered Jul 19 '15 at 16:03

If you want to avoid invalid characters using Scala, I found this worked for me.

import java.nio.charset.CodingErrorAction
import scala.io._

object HelloWorld {

  def main(args: Array[String]) = {
    implicit val codec = Codec("UTF-8")

    codec.onMalformedInput(CodingErrorAction.REPLACE)
    codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

    val dataSource = Source.fromURL("https://www.foo.com")

    for (line <- dataSource.getLines) {

      println(line)
    }
  }
}

How to skip invalid characters in stream in Java/Scala?

4 Answers4

Linked