3

I have the following text and want to extract '- あらたなるきぼう' which is between '(' and the Japanese character '、'

st1='『スター・ウォーズ エピソード4/新たなる希望』( - あらたなるきぼう、Star Wars Episode IV: A New Hope)'

I used two regex methods to extract what I wanted but neither of them worked.

 # -*- coding: utf-8 -*-
  import  re
  st1='『スター・ウォーズ エピソード4/新たなる希望』( - あらたなるきぼう、Sta    r Wars Episode IV: A New Hope)'
  m1 = re.search('\(([^、]*).*、.*\)',st1)
  m2 = re.search('\((.*?)、.+?\)',st1).group(1)

Any idea what I am doing wrong?

Of course I could use the split method, first on '、' then on '(' . First of all it is ugly and not robust and second for some reason it does not split by '(':

st1.split('、')[0].split('(')` 
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
user1848018
  • 1,086
  • 1
  • 14
  • 33
  • Try using codepoints in the regex (if Python supports that). –  Apr 03 '14 at 16:59
  • It is often worth testing regular expressions with e.g. http://regex101.com/#python, which can show you exactly what you're capturing (or not) – jonrsharpe Apr 03 '14 at 17:03

2 Answers2

3

The first character is:

not:

(

These are distinct characters. The first is the FULLWIDTH LEFT PARENTHESIS. The second is the normal ascii open parenthesis.

You must use a unicode string with the right unicode character to get a match:

>>> st1=u'『スター・ウォーズ エピソード4/新たなる希望』( - あらたなるきぼう、Star Wars Episode IV: A New Hope)'
>>> import re
>>> re.search(u'(([^、]*).*、.*\)',st1)
<_sre.SRE_Match object at 0x103717738>
jterrace
  • 64,866
  • 22
  • 157
  • 202
0

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

Refer doc or ans

Community
  • 1
  • 1
onsy
  • 740
  • 4
  • 11