When expressed in normal form D (*) (decomposition), the four pinyin tone use the following combining (unicode) signs:
- COMBINING MACRON (
'\u0304'
) for tone 1
- COMBINING ACUTE ACCENT (
'\u0301'
) for tone 2
- COMBINING CARON (
'\u030c'
) for tone 3
- COMBINING GRAVE ACCENT (
'\u0300'
) for tone 4
That means that automatic processing in Python is almost trivial: you normalize your (unicode) string into its normal form D and replace the above combining characters with their digit value
Code could be:
def to_tone_number(s):
table = {0x304: ord('1'), 0x301: ord('2'), 0x30c: ord('3'),
0x300: ord('4')}
return unicodedata.normalize('NFD', s).translate(table)
You can then use:
>>> print(to_tone_number('''gēge
nǎinai
wàipó'''))
ge1ge
na3inai
wa4ipo2
in Python 3, or in Python 2:
>>> print(to_tone_number(u'''g\u0113ge
n\u01ceinai
w\xe0ip\xf3'''))
ge1ge
na3inai
wa4ipo2
(*) Refs: