Split a sentence into separate words

Question

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼怎么走).

At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:

try to find the first two characters of the sentence in the database (主楼),
if 主楼 is actually a word and it's in the database the script will try to find first three characters (主楼怎). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.
try to do it with the rest of the characters.

I don't really like this approach, because to analyze even a small text it would query the database too many times.

Are there any other solutions to this?

Just curious, if there are no spaces in Chinese, why do you need to -presumably- display them *with* spaces? What is the use case here? Hmm - sorry, read question again, obviously some word dictionary lookup, never mind. — Wim, Jan 26 '10 at 15:08
To add to what Wim says, if it's just for the purpose of line-wrapping, don't bother: as far as I know, you can wrap your line at any character whatsoever. — C. K. Young, Jan 26 '10 at 15:09
The application needs to find separate words and provide them with pinyin (chinese transcription). — Peterim, Jan 26 '10 at 15:10
Do you query the db after each letter? Or only for the first letter in a word? — Kevin, Jan 26 '10 at 15:17
If the first two characters are words, the first three characters are words, and the third character is a word by itself, how will you know which one to use? — sprugman, Jan 26 '10 at 15:52
Well, it doesn't have to be perfect and actually it can't be perfect, it's a machine translation, not a human =) — Peterim, Jan 26 '10 at 16:01

score 6 · Accepted Answer · answered Jan 27 '10 at 19:14

Thanks to everyone for you help!

After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.

A PHP class (http://www.phpclasses.org/browse/package/2431.html)
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)
There are some other solutions availabe if you try searching baidu.com for "中文分词"

Sincerely,

Equ

score 2 · Answer 2 · answered Jan 26 '10 at 15:08

2

You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

answered Jan 26 '10 at 15:08

Vincent Ramdhanie

102,349
23
137
192

It also can be implemented as Regex (you know: automata, finite state machines, regular language etc.). After compilation, most implementation of regex will build kind-of trie. – Yaakov Shoham Jan 26 '10 at 15:12
3

@Y. Shoham: Some say Regex makes you toast in the morning. – Wim Jan 26 '10 at 15:17

webmagnets · Answer 3 · 2012-11-15T15:49:16.780

1

Another one that works well is http://www.itgrass.com/phpanalysis/index.html

Its the only one that I found that works properly with utf-8. The rest only worked for me in gb18030, which caused tons of issues later on down the line. I thought I was going to have to start over, but this one saved me a lot of time.

edited Nov 15 '12 at 15:49

answered Nov 15 '12 at 12:52

webmagnets

2,266
3
33
60

Wim · Answer 4 · 2010-01-26T16:28:03.727

You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.

With decent indexing on the word column though, you shouldn't have too many problems.

Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.

At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.

Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.

At the moment there are only 300 000 words, at maximum there will be 1 million. So I guess it is an option =) — Peterim, Jan 26 '10 at 15:50
In that case, I would *definitely* throw all the words in a dictionary object in memory. — Wim, Jan 26 '10 at 16:26

score 1 · Answer 5 · answered May 29 '10 at 13:03

I do realize that the chinese word segmentation problem is a very complex one, but in some cases this trivial algorithm may be sufficient: search the longest word w starting with the ith character, then start again for the i+length(w)th character.

Here's a Python implementation:

#!/usr/bin/env python
# encoding: utf-8

import re
import unicodedata
import codecs

class ChineseDict:

    def __init__(self,lines,rex):
        self.words = set(rex.match(line).group(1) for line in lines if not line.startswith("#"))
        self.maxWordLength = max(map(len,self.words))

    def segmentation(self,text):
        result = []
        previousIsSticky = False
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    break
            sticky = len(s)==1 and unicodedata.category(s)!="Lo"
            if previousIsSticky or (result and sticky):
                result[-1] += s
            else:
                result.append(s)
            previousIsSticky = sticky
            i = j
        return u" | ".join(result)

    def genWords(self,text):
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    yield s
                    break
            i = j


if __name__=="__main__":
    cedict = ChineseDict(codecs.open("cedict_ts.u8",'r','utf-8'),re.compile(r"(?u)^.+? (.+?) .+"))
    text = u"""33. 你可以叫我夏尔
    戴高乐将军和夫人在科隆贝双教堂村过周末。星期日早晨，伊冯娜无意中走进浴室，正巧将军在洗盆浴。她感到非常意外，不禁大叫一声：“我的上帝！”
    戴高乐于是转过身，看见妻子因惊魂未定而站立在门口。他继续用香皂擦身，不紧不慢地说：“伊冯娜，你知道，如果是我们之间的隐私，你可以叫我夏尔，用不着叫我上帝……”
    """
    print cedict.segmentation(text)
    print u" | ".join(cedict.genWords(text))

The last part uses a copy of the CCEDICT dictionary to segment a (simplified) chinese text in two flavours (resp., with and without non-word characters):

33. 你 | 可以 | 叫 | 我 | 夏 | 尔
    戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末。星期日 | 早晨，伊 | 冯 | 娜 | 无意中 | 走进 | 浴室，正巧 | 将军 | 在 | 洗 | 盆浴。她 | 感到 | 非常 | 意外，不禁 | 大 | 叫 | 一声：“我的 | 上帝！”
    戴高乐 | 于是 | 转 | 过 | 身，看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口。他 | 继续 | 用 | 香皂 | 擦 | 身，不 | 紧 | 不 | 慢 | 地 | 说：“伊 | 冯 | 娜，你 | 知道，如果 | 是 | 我们 | 之间 | 的 | 隐私，你 | 可以 | 叫 | 我 | 夏 | 尔，用不着 | 叫 | 我 | 上帝……”

你 | 可以 | 叫 | 我 | 夏 | 尔 | 戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末 | 星期日 | 早晨 | 伊 | 冯 | 娜 | 无意中 | 走进 | 浴室 | 正巧 | 将军 | 在 | 洗 | 盆浴 | 她 | 感到 | 非常 | 意外 | 不禁 | 大 | 叫 | 一声 | 我的 | 上帝 | 戴高乐 | 于是 | 转 | 过 | 身 | 看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口 | 他 | 继续 | 用 | 香皂 | 擦 | 身 | 不 | 紧 | 不 | 慢 | 地 | 说 | 伊 | 冯 | 娜 | 你 | 知道 | 如果 | 是 | 我们 | 之间 | 的 | 隐私 | 你 | 可以 | 叫 | 我 | 夏 | 尔 | 用不着 | 叫 | 我 | 上帝

score 0 · Answer 6 · answered Jan 26 '10 at 15:05

0

Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.

answered Jan 26 '10 at 15:05

Younes

4,825
4
39
66

score 0 · Answer 7 · answered Jan 26 '10 at 15:12

0

To improve the performance of this, can't you do all these checks before you insert the sentence into the database, and add spaces yourself?

answered Jan 26 '10 at 15:12

Rob Fonseca-Ensor

15,510
44
57

The issue seems to be that for the checks, you'd need to query the dictionary...just so you know where to break up the sentence. – Wim Jan 26 '10 at 15:16
Unfortunately, this should be done automatically by the application. A user inputs a sentence and gets the separate words with pinyin. – Peterim Jan 26 '10 at 15:19

score 0 · Answer 8 · answered Jan 26 '10 at 16:41

(using ABCDE to represent Chinese characters for simplicity)

Let's say you've got the 'sentence' ABCDE input, and your dictionary contains these words that start with A: AB, ABC, AC, AE, and ABB. And presume that the word CDE exists, but DE, nor E do not.

When parsing the input sentence, going left to right, the script pulls the first character A. Instead of querying the database to see if A is a word, query the database to pull all words that start with A.

Loop through those results, grabbing the next few characters from the input string to get a proper comparison:

AB  ?= AB : True
ABC ?= ABC: True
AC  ?= AB : False
AE  ?= AB : False
ABB ?= ABC: False

At this point the program forks down the two 'true' branches it found. On the first, it presumes AB is the first word, and tries to find C-starting words. CDE is found, so that branch is possible. Down the other branch, ABC is the first word, but DE is not possible, so that branch is invalid, meaning the first must be the true interpretation.

I think this method minimized the number of calls to the database (though it might return larger sets from the database, as you're fetching sets of words all starting with the same character). If your database were indexed for this sort of searching, I think this would work better than going letter-by letter. Looking at this whole process now, and the other answers, I think this is actually a trie structure (assuming the character searched for is the root of a tree), as another poster had suggested. Well, here's an implementation of that idea!

score 0 · Answer 9 · edited Apr 01 '19 at 11:18

A good and fast way to segment Chinese text is based on Maximum Matching Segmentation, which is basically will test different length of words to see which combination of segmentation is most likely. It takes in a list of all possible words to do so.

Read more about it here: http://technology.chtsai.org/mmseg/

That's the method I use in my 读者 (DuZhe) Text Analyzer ( http://duzhe.aaginskiy.com ). I don't use a database, actually I pre-load a list of words into an array which does take up about ~2MB of RAM, but executes very quickly.

If you are looking into using lexical segmentation over statistical (though statistical method can be as accurate as ~97% according to some research), a very good segmentation tool is ADSOtrans that can be found here: http://www.adsotrans.com

It uses a database but has a lot of redundant tables to speed up the segmentation. You can also provide grammatical definitions to assist the segmentation.

score -1 · Answer 10 · answered Jan 26 '10 at 17:32

This is a fairly standard task in computational linguistics. It goes by the name "tokenization" or "word segmentation." Try searching for "chinese word segmentation" or "chinese tokenization" and you'll find several tools that have been made to do this task, as well as papers about research systems to do it.

To do this well, you typically will need to use a statistical model built by running a machine learning system on a fairly large training corpus. Several of the systems you can find on the web come with pre-trained models.

Yaakov Shoham · Answer 11 · 2010-01-26T15:21:21.860

-3

You can build very very long Regular Expression.

Edit: I meant to build it automatically with script from the DB. Not to write it by hand.

edited Jan 26 '10 at 15:21

answered Jan 26 '10 at 15:07

Yaakov Shoham

10,182
7
37
45

What do you think a regex would look like that filters a complete dictionary?? – Younes Jan 26 '10 at 15:07
1

@Younes : very very long? ... and ugly – blank Jan 26 '10 at 15:09
@Eineki: After compilation, most implementation of regex will build kind-of-trie-automata, as Vincent Ramdhanie suggested. – Yaakov Shoham Jan 26 '10 at 15:15
@Bedwyr: That's what i meant ;) – Younes Jan 26 '10 at 15:56

Split a sentence into separate words

11 Answers11

Linked

Related