InputStream/byte[]
is binary, Reader/String
is text. The bridging InputStreamReader has an optional parameter for the encoding to be used in conversion. The encoding defaults to the platform encoding.
InputStream iS = new FileInputStream(myFile);
BufferedReader bR = new BufferedReader(new InputStreamReader(iS, encoding));
Testing the file for which encoding it is in, is an art on itself. Violation of UTF-8 multibyte encoding can be detected. ÚTF-16LE and ~BE can often be detected by bytes 0x00 in odd or even positions for ASCII text (when present). I combined finding the encoding with identifying the language, as detecting words with non-ascii characters can help find the encoding used. Using the 100 most frequent words of every language, and encodings per language already helps.
You need to work on bytes, getChannel()
for a FileChannel and then use a ByteBuffer. Simply reading the first 4 KB will generate problems with the last bytes: an UTF-8 sequence could be truncated, or a UTF-16 pair split.
There are Charset constants in StandardCharsets, but only for those standard charsets that are available with every JavaSE installation (StandardCharsets.UTF_8
and ISO_8859_1
for instance). Not very useful in your case. But you can test the availability of a Charset:
Charset.availableCharsets()