Can UTF8 validation be done on a char[], or must it be done at the original byte[]?

Question

I am attempting to validate that files I am ingesting are all strictly UTF8 compliant, and through my several readings, I have come to the conclusion that if the validation is to be done correctly, the original, untampered bytes of the data must be analyzed. If one attempts to look at the string itself after the fact, they are unlikely to find if any characters are non-UTF8 compliant, as Java will attempt to convert them.

I am reading the files normally: I receive an InputStream from the file, and then feed to it an InputStreamReader, then feed that to BufferedReader. It would look something like:

InputStream is = new FileInputStream(fileLocation);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));
BufferedReader br = new BufferedReader(isr);

I can override the BufferedReader class to add some validation for each character it stumbles across.

The issue is that BufferedReader has a char[], not a byte[], for the buffer. That means the bytes get auto-converted to chars.

So, my question is: can this validation be done at the char[] level located in BufferedReader? Although I am somewhat "enforcing" UTF8 here:

InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));

I am seeing characters get transformed from non utf-8 (like, say, utf-16) to utf-8, and breaking some systems. I don't know that the char[] is basically "too late" for this validation. Is it truly?

"*can this validation be done at the `char[]` level located in `BufferedReader`?*" - no. By that time, the UTF-8 has already been parsed and decoded into the `char[]`. If there was a problem, the `char[]` will not have the correct characters in it. So, you really need to validate the original `byte[]` instead. — Remy Lebeau, Jul 14 '21 at 21:52
Hey @RemyLebeau, I have a question. If a file contains UTF16 encoded glyphs, and I set BufferedReader to use UTF8, I don't get an error, and the value is returned normally in a string when I call `readLine()`. What is happening here? — John Lexus, Jul 15 '21 at 14:24
I assume you meant `InputStreamReader` instead of `BufferedReader`? The `BufferedReader` will just store and return whatever UTF-16 codeunits the `InputStreamReader`'s assigned charset outputs from decoding the bytes into characters. If the encoding is mismatched, wrong characters can be returned or even omitted. By default, decoding will return `'\uFFFD'` characters for decoding errors. If you want to change that behavior, create the `InputStreamReader` using a `CharsetDecoder` whose `malformedInput` and `unmappableCharacter` actions have been customized as needed. — Remy Lebeau, Jul 15 '21 at 15:49

score 4 · Answer 1 · answered Jul 14 '21 at 16:22

Define UTF-8 compliant. There are 2 events that you can reasonably call 'invalid'. UTF-8 as a format converts 32-bit numbers into byte sequences, and can't convert just any number, only limited sets (but all numbers that could possibly come up in unicode can be converted).

A valid conversion for a non-existing glyph.

Not every single one of the 32-bit numbers that UTF-8 can store actually are a valid unicode codepoint. However, unicode expands all the time. What isn't valid today might be valid tomorrow. There is no real way to know this stuff unless you have the entire unicode table loaded.

An invalid sequence

Usually when converting bytes to text (char, String, Reader, Writer, StringBuilder - anything that is character oriented), and you attempt to convert an invalid byte sequence, you either get an exception or if the process is in lenient mode, the failure is converted to a character that means 'this was not valid'.

If the exception occured, then you couldn't possibly have a char array (the exception occurred instead of returning a char array). If it didn't, you have that glyph in your characters, so just search for that.

Those are some solid points. I appreciate it. I think I am mostly interested in the scenario of "valid conversions for non-existing glyphs." I take your point though; I'd have to have the entire unicode table loaded. I suppose this requires more effort than is worth it. Thanks for your insight. — John Lexus, Jul 14 '21 at 16:30
Of course, the class that's going to give you most control is [java.nio.charset.CharsetDecoder](https://docs.oracle.com/javase/10/docs/api/java/nio/charset/CharsetDecoder.html) — g00se, Jul 14 '21 at 17:18

Can UTF8 validation be done on a char[], or must it be done at the original byte[]?

1 Answers1