I have a (running) text with many sentences. I have a regular expression that is able to extract the sentences that are terminated by a period, question or exclamation mark. The end of a sentence must be followed by a beginning of the next sentence ( white spaces/tabs/new lines and a capital letter or number). I read a string stored in data and a call the regex.
basic_pat = re.compile(r"[(']?\w.+[)']?[?.!](?=\s+[A-Z\d])")
result = basic_pat.findall(data)
This regex seems to be working if we do not take into consideration the abbreviation cases. In the text I may also have some chapter texts that do not end with a period. For example:
This is the first chapter
Here is the first sentence. Here is the second sentence.Here ids the third sent. Here is the fourth sent...
My question is if it is possible to have one regex that reads only the chapter texts as well as a regex that reads the sentences. The chapters are loose text in a line without a period. Regular sentences may cover several lines. That is, sentences may also have text in a line without period. Is it possible to differentiate the two situations (chapter vs sentences) with regex?