I am attempting to validate that files I am ingesting are all strictly UTF8 compliant, and through my several readings, I have come to the conclusion that if the validation is to be done correctly, the original, untampered bytes of the data must be analyzed. If one attempts to look at the string itself after the fact, they are unlikely to find if any characters are non-UTF8 compliant, as Java will attempt to convert them.
I am reading the files normally: I receive an InputStream
from the file, and then feed to it an InputStreamReader
, then feed that to BufferedReader
. It would look something like:
InputStream is = new FileInputStream(fileLocation);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));
BufferedReader br = new BufferedReader(isr);
I can override the BufferedReader
class to add some validation for each character it stumbles across.
The issue is that BufferedReader
has a char[]
, not a byte[]
, for the buffer. That means the bytes get auto-converted to chars.
So, my question is: can this validation be done at the char[]
level located in BufferedReader? Although I am somewhat "enforcing" UTF8 here:
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));
I am seeing characters get transformed from non utf-8 (like, say, utf-16) to utf-8, and breaking some systems. I don't know that the char[]
is basically "too late" for this validation. Is it truly?