7

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?

What may be comparatively accurate expression using Tperlregex in delphi?

Thanks

Shai
  • 111,146
  • 38
  • 238
  • 371
Warren
  • 795
  • 1
  • 10
  • 19

3 Answers3

6

First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:

He said: "It's OK!"

Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.

Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).

For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.

Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.

In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about

Most people called him Professor Jones, but to me he was simply The Prof.

Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be

Most people called him Professor Jones, but to me he was simply Prof. Bill.
Marek Jedliński
  • 7,088
  • 11
  • 47
  • 57
  • +1 I've got to add: whatever your implementation choice (regex or explicitly coded), you will want to build up a suite of test paragraphs. For each paragraph, you know how many sentences should be reported. You'll often find that trying to implement a new rule will lead to a break in existing rules. – Disillusioned Apr 20 '11 at 20:23
1

Check my tutorial here http://code.google.com/p/graph-expression/wiki/SentenceSplitting. This concrete example can be easily rewritten to regular expressions and some imperative code.

yura
  • 14,489
  • 21
  • 77
  • 126
0

It will be wise to use a NLP processor with a pre-trained model. EnglishSD.nbin is one such model that is available for OpenNLP and it can be used in Visual Studio with SharpNLP.

The advantage of using this method is numerous. For example consider the input

Prof. Jessica is a wonderful woman. She is a native of U.S.A. She is married to Mr. Jacob Jr.

If you are using a regex split, for example

 string[] sentences = Regex.Split(text, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");

Then the above input will be split as

Prof.

Jessica is a wonderful woman.

She is a native of U.

S.

A.

She is married to Mr.

Jacob Jr.

However the desired output is

Prof. Jessica is a wonderful woman.

She is a native of U.S.A. She is married to Mr. Jacob Jr.

This kind of logical sentence split can be achieved only using trained models from OpenNLP project. The method is as simple as this.

private string mModelPath = @"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
    {
        if (mSentenceDetector == null)
        {
            mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
        }

        return mSentenceDetector.SentenceDetect(paragraph);
    }

where mModelPath is the path of the directory containing the nbin file.

The mSentenceDetector is derived from the OpenNLP dll.

You can get the desired output by

string[] sentences = SplitSentences(text);

Kindly read through this article I have written for integrating SharpNLP with your Application in Visual Studio to make use of the NLP tools

Community
  • 1
  • 1