Building Chinese-English dictionary - how to detect which characters form words?

Question

I'm trying to build an application in Rails that will help users read Chinese text. If a user clicks on a Chinese character, they'd get information about the pronunciation and meaning.

I got this to work using a database of a Chinese-English dictionary. However, I'm not sure how to detect whether a character is just a single character or a part of a longer word. For example: I have the text 我是铁公鸡 and the user clicks on the word 公, which means "public" but the app should show highlight 铁公鸡 as "miser". So the character can be a standalone thing or form words with the other characters around.

What's an efficient way to detect what word the character forms? I was thinking of checking the target character and its neighbors against the database and choosing the longest combination that can be found. Any other ideas?

Awesome problem to solve. I imagine you have to highlight both instances to show the possible meanings. So, instead of trying to decide what to show, you show the user all possibilities: A single symbol as a word, or a word made up by several surrounding symbols. — Mohamad, Dec 18 '14 at 15:32
nice problem, indeed. I would probably go for a dedicated [full-text search engine](http://stackoverflow.com/questions/47656/how-do-i-do-full-text-searching-in-ruby-on-rails) because these are specialized in those use-cases, but I must admit that I don't know if any engine supports chinese well. — m_x, Dec 18 '14 at 16:21

score 1 · Answer 1 · answered Dec 18 '14 at 18:59

The method I use at pin1yin1.com is to start from the first character, find the longest string of characters that exists in the dictionary (I use CEDICT) then call that a word and start over with the following character. That mimics the sequential way in which we read or hear words, and in practice it tends to do it right.

It's also easy to do this efficiently with a typical index, since you can quickly retrieve all the entries starting with a character or two, then loop over them looking for the longest match. For your application I would recommend backing up 10 or 20 characters, then identifying the words sequentially the way I do until you find the word that contains the selected character.

Thanks, I will try doing that. – Szymon Borucki Dec 18 '14 at 19:21 — Szymon Borucki, Dec 18 '14 at 19:21

score 1 · Answer 2 · answered Dec 22 '14 at 22:07

You need a Chinese segmenter. There are many types of Chinese segmenters including HMM (Hidden-Markov method), CRF (Conditional Random Fields), MM (maximum matching) segmentation (pdg137 is using MM segmentation). If you search for Chinese segmentation, then you can find open source programs that utilize these different strategies.

You should check out Stanford's Chinese segmentation tool. It's done pretty well in segmentation competitions.

Incidentally, I've already created a website that does what you describe.

score 0 · Answer 3 · answered Dec 22 '14 at 07:03

0

This guy seems to have figured it out http://www.sitepoint.com/efficient-chinese-search-elasticsearch/ He uses Elasticsearch and some plugin for Asian languages.

answered Dec 22 '14 at 07:03

Szymon Borucki

407
4
13

Building Chinese-English dictionary - how to detect which characters form words?

3 Answers3