Counting words from a mixed-language document

Question

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.

To wit:

this is just an example
这只是个例子

should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.

Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.

How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.

score 3 · Answer 1 · edited May 23 '17 at 10:31

You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.

Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

+1 simply for unicode word boundaries. Didn't know there were guidelines for those. — icedwater, Nov 27 '13 at 01:11

score 0 · Answer 2 · answered Nov 26 '13 at 11:27

I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:

(pseudocode)

for each character:
    if character (byte) begins with 1:
        add 1 to total chinese chars
    if it is a space:
        add 1 to total "normal" words
    if it is a newline:
        break

Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.

这是test

However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

Counting words from a mixed-language document

2 Answers2