Regex to differentiate between sentences and chapter text

Question

I have a (running) text with many sentences. I have a regular expression that is able to extract the sentences that are terminated by a period, question or exclamation mark. The end of a sentence must be followed by a beginning of the next sentence ( white spaces/tabs/new lines and a capital letter or number). I read a string stored in data and a call the regex.

basic_pat = re.compile(r"[(']?\w.+[)']?[?.!](?=\s+[A-Z\d])")
result = basic_pat.findall(data)

This regex seems to be working if we do not take into consideration the abbreviation cases. In the text I may also have some chapter texts that do not end with a period. For example:

This is the first chapter
Here is the first sentence. Here is the second sentence.Here ids the third sent. Here is the fourth sent...

My question is if it is possible to have one regex that reads only the chapter texts as well as a regex that reads the sentences. The chapters are loose text in a line without a period. Regular sentences may cover several lines. That is, sentences may also have text in a line without period. Is it possible to differentiate the two situations (chapter vs sentences) with regex?

Don’t forget quotation marks. – tchrist Nov 06 '11 at 21:19 — tchrist, Nov 06 '11 at 21:19

score 3 · Answer 1 · answered Nov 06 '11 at 20:19

3

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

Actually, what you should do is use two regular expressions (now you'll have four problems).

First, go through and break up the text into alternating chapter-headers and not-chapter-headings. Then examine each not-chapter-heading for sentences, paragraphs, and what have you.

How would you break up the following:

Visiting Leipzig, Chapter One: Thomaskirchhof St.

The Bach Museum is on Thomaskirchhof opposite St. Thomas's Church. van Beethoven doesn't have a museum anywhere in Leipzig.

Processing natural language is devilishly difficult. God did a thorough job when He destroyed the Tower of Babel.

answered Nov 06 '11 at 20:19

Michael Lorton

43,060
26
103
144

You *do* use regexes for this; the difference is you don’t use *only* regexes. You are going to have to use machine learning to model likely sentence endings to sort out your streets form your saints. Also, the “van” in Beethoven’s name is spurious. Look it up. No well-formed English sentence is permitted to start with a lowercase letter. However, that doesn’t help you parse things that don’t meet that definition. – tchrist Nov 06 '11 at 21:14
@tchrist -- you mean, on Wikipedia: [http://en.wikipedia.org/wiki/Ludwig_van_Beethoven](http://en.wikipedia.org/wiki/Ludwig_van_Beethoven) ? – Michael Lorton Nov 06 '11 at 21:15

Regex to differentiate between sentences and chapter text

1 Answers1