Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w
is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.