2

I'm running xdmp:encoding-language-detect on a number of documents and getting results like those below. These are definitely in English and considerably larger than the "few hundred bytes" suggested by the documentation for a good detection.

<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>en</language>
  <score>9.88</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>el</language>
  <score>10.24</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>zu</language>
  <score>17.55</score>
</encoding-language>

That detects three languages: English, Greek, and Zulu. In that order, but with increasing scores.

The documentation says:

Scores of 10 and above are high confidence recommendations. The results are given in order of decreasing score. Accuracy may be poor for short documents.

So I'm confused. Should I assume the first match is the most likely one (though in this case it has a score < 10)? Does a higher score not necessarily mean a more reliable match?

eaolson
  • 14,717
  • 7
  • 43
  • 58

0 Answers0