0

I have this Python regular expression code in Python 3 that I do not understand. I appreciate any help to explain what exactly it does with a few examples. The code is this:

# encoding=utf-8
import re
newline = re.sub(r'\s+(((زا(ی)?)?|ام?|ات|اش|ای?(د)?|ایم?|اند?)[\.\!\?\،]*)', r'\1 ', newline)
TJ1
  • 7,578
  • 19
  • 76
  • 119

1 Answers1

3

here is your regular expression:

\s+(((زا(ی)?)?|ام?|ات|اش|ای?(د)?|ایم?|اند?)[\.\!\?\،]*)

and here is a visualization:

Regular expression visualization

Debuggex Demo

Your replacement is r'\1 ' which means replace what you found with the 1st group followed by space. I don't read farsi, but here is another example:

\s+((a|b)[./?]*)

Regular expression visualization

Debuggex Demo

so let's execute some code:

>>> newline = '     a?    b?        a.'
>>> re.sub('\s+((a|b)[./?]*)', r'\1 ', newline)
'a? b? a. '

This eats extra spaces preceding a particular group of characters (the leading \s+) and changes it to the identified group 1 followed by one space (r'\1 ').

dnozay
  • 23,846
  • 6
  • 82
  • 104
  • Thanks for the answer. Based on the above figure, isn't the first group Group 1? Then what you said becomes confusing. Can you elaborate a little more please? For example if I have: `newline = 'رفته اند'`, what should I get after running the code? – TJ1 Jan 23 '14 at 12:56
  • 1
    provided your `newline` is unicode, you'd get the same thing, because there no extra spaces. `r'\1 '` will preserve everything in group 1 except the spaces. – dnozay Jan 23 '14 at 20:36