10

How can I match a sentence of the form "Hello world" or "Hello World". The sentence may contain "- / digit 0-9". Any information will be very helpful to me. Thank you.

Tapas Bose
  • 28,796
  • 74
  • 215
  • 331
  • 1
    How is the first one (`"Hello world"`) a sentence? There's no punctuation. – Matt Ball Apr 05 '11 at 14:25
  • @baba You're right haha. I fixed it. – sawa Apr 05 '11 at 14:33
  • You wrote: `may contain "- / digit 0-9"`? No letters allowed? The question is confusing... – user85421 Apr 05 '11 at 14:36
  • @Matt Ball It's a fair bet this isn't a natural language question, and a 'sentence' in regular expression theory is any sequence of input characters which belongs to the 'language' accepted by the regular expression. – Pete Kirkham Apr 05 '11 at 14:56
  • Actually, I found this to be a pretty challenging question! (See the test data from my answer.) Matching a last sentence having no punctuation makes it a bit trickier. – ridgerunner Apr 05 '11 at 15:45

2 Answers2

25

This one will do a pretty good job. My definition of a sentence: A sentence begins with a non-whitespace and ends with a period, exclamation point or a question mark (or end of string). There may be a closing quote following the ending punctuation.

[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        String subjectString = 
        "This is a sentence. " +
        "So is \"this\"! And is \"this?\" " +
        "This is 'stackoverflow.com!' " +
        "Hello World";
        String[] sentences = null;
        Pattern re = Pattern.compile(
            "# Match a sentence ending in punctuation or EOS.\n" +
            "[^.!?\\s]    # First char is non-punct, non-ws\n" +
            "[^.!?]*      # Greedily consume up to punctuation.\n" +
            "(?:          # Group for unrolling the loop.\n" +
            "  [.!?]      # (special) inner punctuation ok if\n" +
            "  (?!['\"]?\\s|$)  # not followed by ws or EOS.\n" +
            "  [^.!?]*    # Greedily consume up to punctuation.\n" +
            ")*           # Zero or more (special normal*)\n" +
            "[.!?]?       # Optional ending punctuation.\n" +
            "['\"]?       # Optional closing quote.\n" +
            "(?=\\s|$)", 
            Pattern.MULTILINE | Pattern.COMMENTS);
        Matcher reMatcher = re.matcher(subjectString);
        while (reMatcher.find()) {
            System.out.println(reMatcher.group());
        } 
    }
}

Here is the output:

This is a sentence.
So is "this"!
And is "this?"
This is 'stackoverflow.com!'
Hello World

Matching all of these correctly (with the last sentence having no ending punctuation), turns out to be not so easy as it seems!

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Shouldn't a sentence start with a capital letter, if it is starting with a letter? 1 of 100 examples starts with a uppercase letter, but with no letter at all. – user unknown Apr 05 '11 at 15:23
  • @user unknown: Maybe. But a sentence can be whatever you want to define it to be. My definition is stated above. For example, a sentence may begin with the name of program variable which starts with a lowercase letter. – ridgerunner Apr 05 '11 at 15:41
  • Thank you. Actually my question was incomplete as I wrote it in hurry. I should state what I meant by a sentence. Your help is really appreciable. Thanks again. – Tapas Bose Apr 05 '11 at 15:56
  • `x` should be quoted at the beginning of the sentence. :) – user unknown Apr 05 '11 at 16:04
  • @Tapas Bose: You can easily change the part of the regex which matches the first char. If you need it to start with a capital letter, change the `[^.!?\\s]` to just `[A-Z]`. Glad to be of help! – ridgerunner Apr 05 '11 at 16:06
  • 1
    @ridgerunner, can you plz give another RE which can exclude incomplete sentence ie "Hello World" in this case. And can include initials as a part of sentence. Currently, any initial (like Prof. or Mr. ) appears as different sentence and break a complete sentences into multiple sentences. – Amit Kumar Gupta Apr 20 '13 at 17:41
  • @articlestack Take a look at my answer to another similar question: [php sentence boundaries detection](http://stackoverflow.com/a/5844564/433790). That one handles; mr., mrs., dr., etc. – ridgerunner Apr 21 '13 at 14:47
  • @ridgerunner, I have already have seen it. but it giving me error. Although I have done in old style but was looking some smart way as you answered. – Amit Kumar Gupta Apr 21 '13 at 15:04
  • How would you modify this so that you could only capture a sentence that contains a given word within it? ie, just grab sentences that have "Hello" – Patrick Dec 06 '16 at 15:44
  • @Patrick - If even possible, the resulting regex would be extremely complex. The obvious way to handle this would to first parse the sentences, then look inside each sentence one at a time for the given word. – ridgerunner Dec 08 '16 at 15:20
0

If by sentence you mean something that ends with a punctuation mark try this : (.*?)[.?!]

Explanation :

  • .* matches any string. Adding a ? makes it non-greedy matching (matches the smallest string possible)
  • [.?!] matches any of the three punctuation marks
krookedking
  • 2,203
  • 20
  • 21