How to capture all regex groups in one regex?

Question

Given a file like this:

# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, @/
A片 A片 [A pian4] /adult movie/pornography/

I want to build a json object that:

skip lines that starts with #
breaks lines into 4 parts
1. tradition character (spans from start ^ until the next space)
2. simplified character (spans from the first space to the second)
3. pinyin (spans between the square brackets [...])
4. the gloss space between the first / till the last / (note there are cases where there can be slashes within the gloss, e.g. /adult movie/pornography/

I am currently doing it as such:

>>> for line in text.split('\n'):
...     if line.startswith('#'): continue;
...     line = line.strip()
...     simple, _, line = line.partition(' ')
...     trad, _, line = line.partition(' ')
...     print simple, trad
... 
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片

To get the [...], I had to do:

>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'

And to find the /.../, I had to do:

>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'

How do I use regex groups to catch all of them at once which doing multiple partitions/splits/findall?

i am late to party..so i will put it as comment :- https://regex101.com/r/uO0yS1/1 — rock321987, Apr 18 '16 at 06:51

fedorqui · Accepted Answer · 2016-04-18T06:54:33.123

I could extract the info using regular expressions instead. This way, you can catch blocks in groups and then handle them as desired:

import re

with open("myfile") as f:
    data = f.read().split('\n')
    for line in data:
        if line.startswith('#'): continue
        m = re.search(r"^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$", line)
        if m:
            print(m.groups())

That is regular expression splits the string in the following groups:

^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$
  ^^^^^   ^^^^^     ^^^^^       ^^
   1)      2)        3)         4)

That is:

the first word.
the second word.
the text within [ and ].
the text from / up to the / before the end of the line.

It returns:

('A', 'A', 'A', '(slang) (Tw) to steal')
('AA制', 'AA制', 'A A zhi4', 'to split the bill/to go Dutch')
('AB制', 'AB制', 'A B zhi4', 'to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable')
('A咖', 'A咖', 'A ka1', 'class "A"/top grade')
('A圈兒', 'A圈儿', 'A quan1 r5', 'at symbol, @')
('A片', 'A片', 'A pian4', 'adult movie/pornography')

Roland Illig · Answer 2 · 2016-04-18T20:36:43.530

3

p = re.compile(ru"(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+/(.*)/$")
m = p.match(line)
if m:
    simple, trad, pinyin, gloss = m.groups()

See https://docs.python.org/2/howto/regex.html#grouping for more details.

edited Apr 18 '16 at 20:36

answered Apr 18 '16 at 06:48

Roland Illig

40,703
10
88
121

1

Or for brevity `simple, trad, pinyin, gloss = m.groups()` – tripleee Apr 18 '16 at 08:27
Thanks, I don’t write Python on a regular basis, so I’m glad that my answer was useful at all. :) – Roland Illig Apr 18 '16 at 20:36

score 2 · Answer 3 · answered Apr 18 '16 at 06:44

This might help:

preg = re.compile(r'^(?<!#)(\w+)\s(\w+)\s(\[.*?\])\s/(.+)/$',
                  re.MULTILINE | re.UNICODE)

with open('your_file') as f:
    for line in f:
        match = preg.match(line)
        if match:
            print(match.groups())

Take a look here for a detailed explanation of the used regular expression.

AKS · Answer 4 · 2016-04-18T07:00:35.107

1

I created following regex to match all the four groups:

REGEX DEMO

^(.*)\s(.*)\s(\[.*\])\s(\/.*\/)

This does assume that there is only one space in between the groups however if you have more you can just add a modifier.

Here is a demo of how this works with python with the lines provided in the question:

IDEONE DEMO

edited Apr 18 '16 at 07:00

answered Apr 18 '16 at 06:48

AKS

18,983
3
43
54

I assume it would be a good learning for everyone if downvoters also leave relevant comments. – AKS Apr 18 '16 at 09:17

How to capture all regex groups in one regex?

4 Answers4