substring utf-8 characters with alphanumeric character up to 10 words

Question

I have a problem with getting a substring of this string:

GMOクラウドの芦田です。前回、OpenSocialに対応したSNSの「OpenPNE」をインストールしたので、今回はソーシャルアプリを作ってOpenPNE上で公開してみます。また、作ったアプリをmixiアプリとしてmixiにも登録してみましょう。

I just wanted to display up to nth character or at least does not display broken/incomplete words.

at first i tried.

$content = "GMOクラウドの芦田です。前回、OpenSocialに対応したSNSの「OpenPNE」をインストールしたので、今回はソーシャルアプリを作ってOpenPNE上で公開してみます。また、作ったアプリをmixiアプリとしてmixiにも登録してみましょう。";
$content = mb_substr($content, 0, 10, 'UTF-8');

but it results to:

GMOクラウドの芦田です。前回、OpenSo

the word is not complete

i also tried using regex:

$content = "GMOクラウドの芦田です。前回、OpenSocialに対応したSNSの「OpenPNE」をインストールしたので、今回はソーシャルアプリを作ってOpenPNE上で公開してみます。また、作ったアプリをmixiアプリとしてmixiにも登録してみましょう。";
if (preg_match('/^.{1,40}\b/s', $content, $match))
{
    print_r($match);
}

resulted to:

Array ( [0] =>GMO )

what could have been done to get something like

GMOクラウドの芦田です。前回、OpenSocial

word should be complete. is there an mb_ function in php i could use to accomplish this?

"Word boundary" in Japanese is not a trivial concept... Do you just not want *latin* words to be broken up, or does that apply to the Japanese too? — deceze, Nov 07 '12 at 08:46
@specialscope with mb_ string functions of php at least does not break the word. — Jayson O., Nov 07 '12 at 09:12
The question is, is *"GMOクラウドの芦"* acceptable, or does it need to consider **Japanese word boundaries** as well and return *"GMOクラウドの芦田"*? — deceze, Nov 07 '12 at 09:14
**That is the question!** Do you need to? Because then the answer will be a lot more complex. Or don't you care about that? Then the answer will be simpler. — deceze, Nov 07 '12 at 09:22
If your requirement is breaking into words you need to use one of the morphological analyzers for japanese. For eg. chasen. — specialscope, Nov 07 '12 at 09:54
well. thats complicated. i will tell my client about how difficult it is, and what disadvantages may occur in doing this. thanks guys. — Jayson O., Nov 08 '12 at 01:25

score 0 · Answer 1 · answered Jun 08 '13 at 11:30

You need Morphological Analysis tools such as Mecab to convert string to array. Mecab can be used by command line and PHP extension. If you use homebrew, install mecab and mecab-ipadic.

$words = [
  'GMO', 'クラウド', 'の', '芦田', 'です。', '前回、', 
  'OpenSocial', 'に', '対応した', 'SNS'
];

$max = 26;

$ret = '';
$i = 0;

while(mb_strlen($ret.$words[$i], 'UTF-8') <= $max) {

  $ret .= $words[$i];
  $i += 1;
}

var_dump(
  mb_strlen($ret, 'UTF-8'),
  'GMOクラウドの芦田です。前回、OpenSocial' === $ret
);

substring utf-8 characters with alphanumeric character up to 10 words

1 Answers1