Using Japanese delimiters in re package

Question

I have the following text and want to extract '- あらたなるきぼう' which is between '(' and the Japanese character '、'

st1='『スター・ウォーズ エピソード4/新たなる希望』（ - あらたなるきぼう、Star Wars Episode IV: A New Hope)'

I used two regex methods to extract what I wanted but neither of them worked.

 # -*- coding: utf-8 -*-
  import  re
  st1='『スター・ウォーズ エピソード4/新たなる希望』（ - あらたなるきぼう、Sta    r Wars Episode IV: A New Hope)'
  m1 = re.search('\(([^、]*).*、.*\)',st1)
  m2 = re.search('\((.*?)、.+?\)',st1).group(1)

Any idea what I am doing wrong?

Of course I could use the split method, first on '、' then on '(' . First of all it is ugly and not robust and second for some reason it does not split by '(':

st1.split('、')[0].split('(')`

Try using codepoints in the regex (if Python supports that). — , Apr 03 '14 at 16:59
It is often worth testing regular expressions with e.g. http://regex101.com/#python, which can show you exactly what you're capturing (or not) — jonrsharpe, Apr 03 '14 at 17:03

score 3 · Accepted Answer · answered Apr 03 '14 at 17:00

3

The first character is:

（

not:

These are distinct characters. The first is the FULLWIDTH LEFT PARENTHESIS. The second is the normal ascii open parenthesis.

You must use a unicode string with the right unicode character to get a match:

>>> st1=u'『スター・ウォーズ エピソード4/新たなる希望』（ - あらたなるきぼう、Star Wars Episode IV: A New Hope)'
>>> import re
>>> re.search(u'（([^、]*).*、.*\)',st1)
<_sre.SRE_Match object at 0x103717738>

answered Apr 03 '14 at 17:00

jterrace

64,866
22
157
202

I know I can simplify, but I intentionally left it as close to OP's regex as possible. – jterrace Apr 03 '14 at 17:10
Nice observation on the （, could have figured it out myself. Thanks – user1848018 Apr 04 '14 at 18:57

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

Refer doc or ans

edited May 23 '17 at 12:05

Community

1
1

answered Apr 03 '14 at 16:59

onsy

740
4
11

Using Japanese delimiters in re package

2 Answers2