I'm running xdmp:encoding-language-detect on a number of documents and getting results like those below. These are definitely in English and considerably larger than the "few hundred bytes" suggested by the documentation for a good detection.
<encoding-language xmlns="xdmp:encoding-language-detect">
<encoding>utf-8</encoding>
<language>en</language>
<score>9.88</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
<encoding>utf-8</encoding>
<language>el</language>
<score>10.24</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
<encoding>utf-8</encoding>
<language>zu</language>
<score>17.55</score>
</encoding-language>
That detects three languages: English, Greek, and Zulu. In that order, but with increasing scores.
The documentation says:
Scores of 10 and above are high confidence recommendations. The results are given in order of decreasing score. Accuracy may be poor for short documents.
So I'm confused. Should I assume the first match is the most likely one (though in this case it has a score < 10)? Does a higher score not necessarily mean a more reliable match?