Given a file like this:
# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, @/
A片 A片 [A pian4] /adult movie/pornography/
I want to build a json object that:
- skip lines that starts with
#
- breaks lines into 4 parts
- tradition character (spans from start
^
until the next space) - simplified character (spans from the first space to the second)
- pinyin (spans between the square brackets
[...]
) - the gloss space between the first
/
till the last/
(note there are cases where there can be slashes within the gloss, e.g./adult movie/pornography/
- tradition character (spans from start
I am currently doing it as such:
>>> for line in text.split('\n'):
... if line.startswith('#'): continue;
... line = line.strip()
... simple, _, line = line.partition(' ')
... trad, _, line = line.partition(' ')
... print simple, trad
...
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片
To get the [...]
, I had to do:
>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'
And to find the /.../
, I had to do:
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'
How do I use regex groups to catch all of them at once which doing multiple partitions/splits/findall?