2

I'm trying to segment a paragraph to sentences. I selected '.', '?' and '!' as the segmentation symbols. I tried:

format = r'((! )|(. )|(? ))'
delimiter = re.compile(format)
s = delimiter.split(line)

but it gives me sre_constants.error: unexpected end of pattern

I also tried

format = [r'(! )',r'(? )',r'(. )']
delimiter = re.compile(r'|'.join(format))

it also causes error.

What's wrong with my method?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
ChuNan
  • 1,131
  • 2
  • 11
  • 27

1 Answers1

6

. (wildcard) and ? (zero or one, quantifier) are special regex characters, you need to escape them to use them literally.

However, in your case it would be much simpler to use a character class (inside which these characters aren't special anymore):

split(r'[!.?] ')

A character class [...] stands for "one character, any of the ones included inside the character class".

Robin
  • 9,415
  • 3
  • 34
  • 45
  • Thank you for your answer. In my case, I would include the space too (usually there is a space after each sentence, if I use [.?!] directly, for every sentence I printed out, (i) there is a space at the front, (ii) "What do you mean?"said by Mary will be segmented to two sentences instead of one) – ChuNan Apr 17 '14 at 14:57
  • Saw your update. it works. Thanks a lot! Will accept in 3 min as required :) – ChuNan Apr 17 '14 at 14:59
  • @ChuNan: Updated indeed. Glad I could help – Robin Apr 17 '14 at 15:02
  • Upvoting for short and sweet. I have been noticing your fine regex style. Yes, load up these character classes! `[[?*+.-]` :) – zx81 May 05 '14 at 23:12