How to count number of "words" in Chinese/Japanese content in Javascript

Question

I'm trying to write a method to count the number of words when the content is in chinese and japanese. This should exclude the special characters / punctuations / whiteSpaces.

I tried creating a regex for each locale and find the words based on it. Tried looking for existing regex on internet but none of them seems to be working. My approach -

function countWords(text, locale) {
  let wordCount = 0;
  
  // Set the word boundary based on the locale
  let wordBoundary = '\\b';
  
  if (locale === 'ja') {
    // Japanese word boundary
    wordBoundary = '[\\p{Script=Hiragana}\\p{Script=Katakana}\\p{Script=Han}ー]+';
  } else if (locale === 'zh') {
    // Chinese word boundary
    wordBoundary = '[\\p{Script=Han}]+';
  }
  
  const regex = new RegExp(wordBoundary, 'gu');
  const matches = text.matchAll(regex);
  
  for (const match of matches) {
    wordCount++;
  }
  
  return wordCount;
}

I thought this should work, but I'm comparing the word count in MS word and using this logic, they are coming different

Word seems to just count the number of characters, so if you just remove the `+` quantifier in your regex, the output should match Word's. — Sweeper, Jun 23 '23 at 06:54
`.matchAll()` and `for (const ... of ...) {}` are unnecessary if you just want the number of matches. Just access `.length` directly: `const wordCount = text.match(regex)?.length ?? 0`. — InSync, Jun 23 '23 at 07:13
I do not think regex is a good way for such problem. In any case, there should be libraries which implement Unicode text segmentation: https://www.unicode.org/reports/tr29/ — Giacomo Catenazzi, Jun 23 '23 at 07:56

score 0 · Answer 1 · answered Jun 23 '23 at 08:53

Well, I did similer type of thing in Python.

Instead of completely depending on regular expressions, you can use existing language processing libraries that provide better word segmentation algorithms specifically designed for Chinese and Japanese. Here are a couple of popular libraries you can consider:

For Chinese: Jieba (结巴分词) is a widely used Chinese text segmentation library for Python. It provides efficient word segmentation for Chinese text. You can integrate Jieba into your JavaScript code using tools like Emscripten or WebAssembly to leverage its word segmentation capabilities.
For Japanese: MeCab (めかぶ) is a popular Japanese morphological analyzer and part-of-speech tagger. It can efficiently segment Japanese text into words. Similarly to Jieba, you can try using tools like Emscripten or WebAssembly to use MeCab within your JavaScript code.

Here's an example of how you can modify your code to use the Jieba library for Chinese word segmentation:

// Import Jieba library (assuming it's integrated using Emscripten or WebAssembly)
const Jieba = require('jieba');

function countWords(text, locale) {
  let wordCount = 0;
  
  if (locale === 'zh') {
    // Use Jieba for Chinese word segmentation
    const words = Jieba.cut(text);
    wordCount = words.length;
  } else {
    // For other languages, use a simplified word count method
    const words = text.split(/\s+/).filter(word => word.trim() !== '');
    wordCount = words.length;
  }
  
  return wordCount;
}

Please note that integrating Jieba or MeCab into JavaScript might require additional setup steps, such as compiling the libraries for the web or using pre-compiled versions specifically built for JavaScript environments.

Peter Seliger · Accepted Answer · 2023-06-23T15:23:56.433

A possible word count approach could be based on a text segmentation array which was the result of calling an Intl.Segmenter instance's segment method.

Each segmented item features properties like e.g. ...

{ segment: 'words', index: 9, input: 'How many words ...', isWordLike: true }

... thus, in order to get the total word count, one could reduce the array of text segment items by validating an item's isWordLike value ...

function countWords(text, locale) {
  return [
    ...new Intl.Segmenter(locale, { granularity: 'word' })
      .segment(text)
  ]
  .reduce((wordCount, { isWordLike }) =>
    wordCount + Number(isWordLike), 0
  );
}

console.log(
  "countWords('How many words does the text contain?', 'en') ?..",
  countWords('How many words does the text contain?', 'en'),
);
console.log(
  "countWords('Combien de mots contient ce texte ?', 'fr') ?..",
  countWords('Combien de mots contient ce texte ?', 'fr'),
);

console.log(
  "countWords('そのテキストには何語含まれていますか？', 'ja') ?..",
  countWords('そのテキストには何語含まれていますか？', 'ja'),
);
console.log(
  "countWords('该文本包含多少个单词？', 'zh') ?..",
  countWords('该文本包含多少个单词？', 'zh'),
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

Note ... as of now Firefox still does not support/implement Intl.Segmenter

How to count number of "words" in Chinese/Japanese content in Javascript

2 Answers2