0

I'm looking for php chinese segmentation

because Chinese words don't have space, it affect fulltext search

ex.

$_GET['text']="中文分詞搜尋";
$text=$_GET['text'];(user's input)
$text; -chinese segment function-> $text="中文 分詞 搜尋";(result)
Ben
  • 2,562
  • 8
  • 37
  • 62
  • 1
    What do you want to achieve? Search by each keyword? You mentioned *fulltext search*, do you mean you're going to search the keyword in database? Please provide more information. – Raptor Mar 17 '14 at 02:19
  • what I need is break the string into words with space, because is different from English, Chinese don't have space in sentence, ex. I love PHP, in chinese it will be IlovePHP – Ben Mar 17 '14 at 02:21
  • yes I want to search keywords in database, ex. If use type I love PHP, In English, fulltext search will search those keywords (I, love, PHP) – Ben Mar 17 '14 at 02:22
  • You can use `mb_split`. See the question I marked as duplicate (Note: I know Japanese != Chinese, but they are in same category: multibyte character) – Raptor Mar 17 '14 at 02:35
  • i'm not looking for explode, user's input don't have space, I need to go through a dictionary function to find out 中文,分詞,搜尋 those are 3 words and add space between – Ben Mar 17 '14 at 02:37
  • 1
    Guys - he isn't asking how to split on chars, he wants to split on Chinese words. – Danack Mar 17 '14 at 02:45
  • Stemming for Chinese may provide better results? https://code.google.com/p/sphinx-for-chinese/ – RCNeil Mar 17 '14 at 03:05

2 Answers2

1

It's extremely easy to find such libraries if you ask Google.

To ensure performance, mostly the kernel algorithm is implemented in native language like C/C++.

Also there's one based on RESTful api (with php interface):

A pure php implementation (may be slow):

A online web service, with php client driver.

0

Try:

<?php
$str = '蚂蚁学院,欢迎您的光临!';
function mbstringtoarray($str,$charset) {
  $strlen=mb_strlen($str);
  while($strlen){
    $array[]=mb_substr($str,0,1,$charset);
    $str=mb_substr($str,1,$strlen,$charset);
    $strlen=mb_strlen($str);
  }
  return $array;
}
$arr = mbstringtoarray($str,"gbk");
print_r($arr);
?> 

The Output will be:

Array
(
  [0] => 蚂
  [1] => 蚁
  [2] => 学
  [3] => 院
  [4] => ,
  [5] => 欢
  [6] => 迎
  [7] => 您
  [8] => 的
  [9] => 光
  [10] => 临
  [11] => !
)

you cant divide it exactly word by word: 蚂蚁/学院/欢迎/您/的/光临

if you insist, you will need another table that to store these meaning full word, as php wont recognized it by default

Teddybugs
  • 1,232
  • 1
  • 12
  • 37