Regex for Matching Pinyin

Question

I'm looking for a regular expression that can correctly match valid pinyin (e.g. "sheng", "sou" (while ignoring invalid pinyin, e.g. "shong", "sei"). Most of the regex provided in the top Google results match invalid pinyin in some cases.

Obviously, no matter what approach one takes, this will be a monster regex, and I'm especially interested in the different approaches one could take to solve this problem. For example, "Optimizing a regular expression to parse chinese pinyin" uses lookbacks.

A table of valid pinyin can be found here: http://pinyin.info/rules/initials_finals.html

Nice catch. "Sou" is valid, so I changed the second one to "sei", which is invalid pinyin. — stevendaniels, Jun 04 '14 at 03:59
Great question. For practical applications, a lookup table has several advantages over a regex. — Aaron Brick, Dec 11 '17 at 21:37
A really lazy brute-force solution would be to take [column 6 "Hanyu Pinyin" (or another column depending on your needs) from here](https://en.wikipedia.org/wiki/Comparison_of_Chinese_transcription_systems), and then replace every vowel with itself and its tonal accents (e.g. `a` → `[aāáǎà]`). This would include some false positives (e.g. `yào` is valid but `yaò` is not). Then separate the syllables by `|` and voilà! — gnucchi, Jun 01 '19 at 00:53
For accented pinyin a rough and fuzzy approach would be using `.*[ĀāÁáǍǎÀàĒēÉéĚěÈèĪīÍíǏǐÌìŌōÓóǑǒÒòŪūÚúǓǔÙùÜüǗǘǙǚǛǜ«»⸢⸣⸤⸥]+.*` — of course, this would also match any non-english character sets such as German, French, Hungarian etc. The proper way would be using the chars to build a replacement map to preprocess the data before applying *stevendaniels'* answer. — ccpizza, Jun 16 '20 at 22:18

stevendaniels · Accepted Answer · 2014-05-10T02:40:15.503

11

I went for a regex that grouped smaller regexes by the pinyin's initial (usually the first letter). So, the first group includes all "b", "p" and "m" sounds, then "f", then "d" and "t", etc.

This approach seems easy to read and should be easy to edit (if it needs corrections or additions). I also added exceptions to the begging of groups in order to improve readability.

([mM]iu|[pmPM]ou|[bpmBPM](o|e(i|ng?)?|a(ng?|i|o)?|i(e|ng?|a[no])?|u))|
([fF](ou?|[ae](ng?|i)?|u))|([dD](e(i|ng?)|i(a[on]?|u))|
[dtDT](a(i|ng?|o)?|e(i|ng)?|i(a[on]?|e|ng|u)?|o(ng?|u)|u(o|i|an?|n)?))|
([nN]eng?|[lnLN](a(i|ng?|o)?|e(i|ng)?|i(ang|a[on]?|e|ng?|u)?|o(ng?|u)|u(o|i|an?|n)?|ve?))|
([ghkGHK](a(i|ng?|o)?|e(i|ng?)?|o(u|ng)|u(a(i|ng?)?|i|n|o)?))|
([zZ]h?ei|[czCZ]h?(e(ng?)?|o(ng?|u)?|ao|u?a(i|ng?)?|u?(o|i|n)?))|
([sS]ong|[sS]hua(i|ng?)?|[sS]hei|[sS][h]?(a(i|ng?|o)?|en?g?|ou|u(a?n|o|i)?|i))|
([rR]([ae]ng?|i|e|ao|ou|ong|u[oin]|ua?n?))|
([jqxJQX](i(a(o|ng?)?|[eu]|ong|ng?)?|u(e|a?n)?))|
(([aA](i|o|ng?)?|[oO]u?|[eE](i|ng?|r)?))|
([wW](a(i|ng?)?|o|e(i|ng?)?|u))|
[yY](a(o|ng?)?|e|in?g?|o(u|ng)?|u(e|a?n)?)

Here's the Debuggex example I created.

Regular expression visualization

edited May 10 '14 at 02:40

answered Dec 23 '13 at 02:35

stevendaniels

2,992
1
27
31

Hmm for some reason I can't seem to get it to match "shi", "zhi", "zi", "si" etc? – redshift5 Mar 20 '14 at 04:21
2

I have modified your expression to include the missing "i" matchers: https://www.debuggex.com/r/JG_eVfJIoxGtkmQ_ – redshift5 Mar 20 '14 at 04:54
1

Added capital letter matching and made a minor fix so that it matches "er". – stevendaniels May 10 '14 at 02:41
1

capital matching can be achieved using the `i` flag – Édouard Lopez Jun 03 '14 at 14:22
1

True, but technically, capitals letters are only valid for the first character of a word, and only if that word is a proper noun or at the beginning of a sentence. – stevendaniels Jun 04 '14 at 04:02
isn't it also missing `nü, lü, nüe, lüe`? – Kai Carver Aug 30 '17 at 17:04
ah, never mind, it uses `v` instead of `ü`, so `nv, lv, nve, lve`. – Kai Carver Aug 30 '17 at 17:10

score 2 · Answer 2 · answered Jun 03 '14 at 14:55

I would use a combination approach that is not solely regex.

Check for valid pinyin:

grab word
grab letters from the beginning of the word as long as they are consonants. This separates the initial sound from the final sound.
check that the initial and final are valid...
...and if so, see if their combination is allowed (via a table like this, but the entries are simply 1's and 0's).

Regex for Matching Pinyin

2 Answers2

Linked