I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走
(with spaces it would be: 主楼 怎么 走
).
At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:
try to find the first two characters of the sentence in the database (
主楼
),if
主楼
is actually a word and it's in the database the script will try to find first three characters (主楼怎
).主楼怎
is not a word, so it's not in the database => my application now knows that主楼
is a separate word.try to do it with the rest of the characters.
I don't really like this approach, because to analyze even a small text it would query the database too many times.
Are there any other solutions to this?