Detect encoding of HUGE files

Question

In Java, There are couple of libraries for detecting encoding of Text files, like google's juniversalchardet and TikaEncodingDetector.

Although, for huge files it would take to much time.

One approach is to use these libraries on a sample of the file (i.e first 1000 bytes). The problem with that is that it may cut the last word in the middle which can "rubbish" it in such a way it would be recognized as a different encoding.

My proposal - let's remove bytes from the end till we see a whitespace (32 ASCII). In that way we guarantee not to "break" any word.

[ In UTF-16LE every byte is followed by '/0', so in order to take care of it - if the next byte of the whitespace is '/0', we will try to detect this bytes chunk with and without the leading '/0' ]

Do you think it might work?

int i = bytes.length - 1;
while (i >= 0 && bytes[i] != 32) {
   i -= 1;
}
return DETECT(Arrays.copyOf(bytes, i));

I think you are over-engineering. Encoding and languages are often disconnected, and most algorithms do not look it, or they look just nearby letters, not entire words. But just do with 4096 or something not very large, but also not small, and one word is not a problem. And we have often text with citation in different language. — Giacomo Catenazzi, Apr 16 '21 at 12:17
Have you actually determined that these two libraries look at the entire file when detecting the encoding? — Robert Harvey, Apr 16 '21 at 12:20
@GiacomoCatenazzi thanks, I started with this approach but encountered some cases it's completely wrong — Oz Zafar, Apr 16 '21 at 12:21
For example, Tika says: *"By looking for special ("magic") patterns of bytes **near the start of the file,** it is often possible to detect the type of the file."* — Robert Harvey, Apr 16 '21 at 12:22
@RobertHarvey No, but they expect a byte array. So I have to read bytes from the file and it takes too much time to read all of it — Oz Zafar, Apr 16 '21 at 12:23
You cannot expect 100% accuracy. No libraries can do it. Did you check if with a larger snippet do you get better return (I doubt). If you have some extra data (region/country), you should use it. — Giacomo Catenazzi, Apr 16 '21 at 12:24
Tika says it will work with a `stream` object. See http://tika.apache.org/1.26/detection.html#The_Detector_Interface — Robert Harvey, Apr 16 '21 at 12:25
Note: on large file you may get different encoding on different parts (look e.g. mbox: every mail could have own encoding, specified in header of that mail). But it is also common to find source code with mixed encoding (especially copyright [just copy paste from other documents]) — Giacomo Catenazzi, Apr 16 '21 at 12:27

Detect encoding of HUGE files

0 Answers0