Creating an effective word counter including Chinese/Japanese and other accented languages

Question

After trying to figure how to have an effective word counter of a string, I know about the existing function that PHP has str_word_count but unfortunately it doesn't do what I need it to do because I will need to count the number of words that includes English, Chinese, Japanese and other accented characters.

However str_word_count fails to count the number of words unless you add the characters in the third argument but this is insane, it could mean I have to add every single character in the Chinese, Japanese, accented characters (etc) language but this is not what I need.

Tests:

str_word_count('The best tool'); // int(3)
str_word_count('最適なツール'); // int(0)
str_word_count('最適なツール', 0, '最ル'); // int(5)

Anyway, I found this function online, it could do the job, but sadly it fails to count:

function word_count($str)
{
    if($str === '')
    {
        return 0;
    }

    return preg_match_all("/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u", $str);
}

Tests:

word_count('The best tool') // int(3)
word_count('最適なツール'); // int(1)

// With spaces
word_count('最 適 な ツ ー ル'); // int(5)

Basically I'm looking for a good UTF-8 supported word counter that can count words from every typical word/accented/language symbols - is there a possible solution to this?

Word counting languages that do not use spaces is a hard problem and basically only solvable using a dictionary and algorithm tuned for that particular language. PHP has nothing like it built-in, and you may even be hard pressed to find any such library written in PHP. — deceze, Jun 18 '12 at 14:14

score 1 · Answer 1 · answered Jun 18 '12 at 14:33

There's the Kuromoji morphological analyzer for Japanese that can be used for word counting. Unfortunately it's written in Java, not PHP. Since porting it all to PHP is quite a huge task, I'd suggest writing a small wrapper around it so you can call it on the command line, or look into other PHP-Java bridges.

I don't know how applicable it is to languages other than Japanese. You may want to look into the Apache Tika project for similar such libraries.

score 1 · Answer 2 · answered Sep 16 '16 at 12:34

I've had good results using the Intl extension's break iterator which tokenizes strings using locale-aware word boundaries. e.g:

<?php
$words = IntlBreakIterator::createWordInstance('zh');
$words->setText('最適なツール');

$count = 0;
foreach( $words as $offset ){
  if( IntlBreakIterator::WORD_NONE !== $words->getRuleStatus() ){
    $count++;
  }
}

printf("%u words", $count ); // 3 words

As I don't understand Chinese I can't verify that "3" is the correct answer. However, it produces accurate results for scripts I do understand, and I am trusting in the ICU library to be solid.

I also note that the passing of the "zh" parameter seems to make no difference to the result, but the argument is mandatory.

I'm running Intl PECL-3.0.0 and ICU version is 55.1. I discovered that my CentOS servers were running older versions than these and they didn't work for Chinese. So make sure you have the latest versions.

Boris Guéry · Accepted Answer · 2012-06-18T14:29:31.383

0

You can take a look at the mbstring extension to work with UTF-8 strings.

mb_split() split a mb string using a regex pattern.

<?php 
printf("Counting words in: %s\n", $argv[1]);
mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8");
$r = mb_split(' ', $argv[1]); 
print_r($r); 
printf("Word count: %d\n", count($r));

$ php mb.php "foo bar"
Counting words in: foo bar
Array
(
    [0] => foo
    [1] => bar
)
Word count: 2


$ php mb.php "最適な ツール"
Counting words in: 最適な ツール
Array
(
    [0] => 最適な 
    [1] => ツール
)
Word count: 2

~~Note: I had to add 2 spaces between characters to get a correct count~~ Fixed by setting mb_regex_encoding() & mb_internal_encoding() to UTF-8

However, in Chinese the concept of "words" doesn't exist (and may too in Japanese in some case), so you may never get a pertinent result in such way...)

You may need to write an algorithm using a dictionnary to determine which groups of characters is a "word"

edited Jun 18 '12 at 14:29

answered Jun 18 '12 at 14:23

Boris Guéry

47,316
8
52
87

*"I had to add 2 spaces..."* - Well yes, that is exactly the problem the OP is trying to solve. And Japanese does not usually contain spaces. -1 – deceze Jun 18 '12 at 14:27
@deceze, well doubling the space count is not a problem, however it has been solved by setting the `mb_regex_encoding()` and `mb_internal_encoding()` to UTF-8 – Boris Guéry Jun 18 '12 at 14:30
1

UTF-8 has nothing to do with the problem at hand. The problem is that Japanese (and other languages) does not have word separators, so you cannot simply `mb_split` it. – deceze Jun 18 '12 at 14:34
@deceze, did you read the text in bold? The function assumes that a word is a group of characters separated by a space. It is not intended to count words semantically speaking. – Boris Guéry Jun 18 '12 at 14:41

Creating an effective word counter including Chinese/Japanese and other accented languages

3 Answers3

Linked