In Java, There are couple of libraries for detecting encoding of Text files, like google's juniversalchardet and TikaEncodingDetector.
Although, for huge files it would take to much time.
One approach is to use these libraries on a sample of the file (i.e first 1000 bytes). The problem with that is that it may cut the last word in the middle which can "rubbish" it in such a way it would be recognized as a different encoding.
My proposal - let's remove bytes from the end till we see a whitespace (32 ASCII). In that way we guarantee not to "break" any word.
[ In UTF-16LE every byte is followed by '/0', so in order to take care of it - if the next byte of the whitespace is '/0', we will try to detect this bytes chunk with and without the leading '/0' ]
Do you think it might work?
int i = bytes.length - 1;
while (i >= 0 && bytes[i] != 32) {
i -= 1;
}
return DETECT(Arrays.copyOf(bytes, i));