0

I have a string with sentences I wanted to separate into individual sentences. The string has a lot of subtleties that are difficult to capture and split. I cannot use the nltk library either. My current regex does the best job among all others I have tried, but misses some sentences that start in a new line (implying a new paragraph). I was wondering if there was an easy way to modify the current expression to also split when there is a new line.

import re
file = open('data.txt','r')
text = file.read()
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

The current regexp is as follows:

sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

I would essentially need to modify the expression to also split when there is a new line.

  • `text = text.replace("\n","")` but there are no newlines ... –  Mar 29 '19 at 18:00
  • 1
    Can you post the input data you're working on? I'm not sure what those negative lookbehind sequences are supposed to make your pattern avoid. I'm guessing you were getting false positives on abbreviations or something. – CAustin Mar 29 '19 at 18:00
  • `\n|(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s` if Python does stuff like this `(?<=\.|\?)` It's funny, your exact regex was posted by someone yesterday. Don't become a new user just to repost the same question... –  Mar 29 '19 at 18:02
  • sorry, i deleted text = text.replace("\n","") this. Yes, I used someone else's regex I found on stackoverflow. Did not make a new account to repost or anything. – Ashley Peedikaparambil Mar 29 '19 at 18:11
  • See, I just suggested prepending a `\n|` alternation because your current regex basically splits on whitespace `\s` of which newline is one of. The only difference is the `\n` is not qualified with assertions. –  Mar 29 '19 at 19:21
  • No one dinged this as a duplicate? https://stackoverflow.com/questions/22042948/split-string-using-a-newline-delimiter-with-python https://stackoverflow.com/questions/13169725/how-to-convert-a-string-that-has-newline-characters-in-it-into-a-list-in-python Or for not giving enough info cause based on the question, this has been asked several times before with or without regex. I'm just asking. Is there more info? – FailSafe Mar 29 '19 at 19:24

0 Answers0