16

I'm trying to write a regex pattern that will match any sentence that begins with multiple or one tab and/or whitespace. For example, I want my regex pattern to be able to match " hello there I like regex!" but so I'm scratching my head on how to match words after "hello". So far I have this:

    String REGEX = "(?s)(\\p{Blank}+)([a-z][ ])*";
    Pattern PATTERN = Pattern.compile(REGEX);
    Matcher m = PATTERN.matcher("         asdsada  adf adfah.");
    if (m.matches()) {
        System.out.println("hurray!");
    }

Any help would be appreciated. Thanks.

Jeroen Vannevel
  • 43,651
  • 22
  • 107
  • 170
user1923
  • 297
  • 1
  • 6
  • 12
  • 2
    How do you define a sentence? Is it a string of characters ending with a punctuation mark, or do you have a stricter definition? – Taylor Hx Dec 02 '13 at 04:18
  • My sentence must start with either one or more whitespaces/tabs. (tabs and spaces can be bunched together before any non-whitespace phrase of characters appears). Each word after the first must be separated by a whitespace. And yes, the sentence must end with a punctuation. – user1923 Dec 02 '13 at 04:19
  • 1
    @user1923 Your example sentence doesn't end in a period. – Steve P. Dec 02 '13 at 04:22
  • ^ please look at the stricter definition i posed above. – user1923 Dec 02 '13 at 04:23

8 Answers8

33
String regex = "^\\s+[A-Za-z,;'\"\\s]+[.?!]$"

^ means "begins with"
\\s means white space
+ means 1 or more
[A-Za-z,;'"\\s] means any letter, ,, ;, ', ", or whitespace character
$ means "ends with"

Steve P.
  • 14,489
  • 8
  • 42
  • 72
  • 1
    Thanks for your reply, but the regex you typed up won't compile. When I put it into eclipse, it gives me this error: "Syntax error on tokens, ( expected instead". Do you know how to fix that without messing up your code? – user1923 Dec 02 '13 at 04:29
  • @user1923 fixed. Sorry, missed escaping `"`. – Steve P. Dec 02 '13 at 04:31
  • 6
    Note: This regex does not scale. If you have senteces with M.D. in them or Mrs. Smith. It will not work. – Eric Uldall Jul 09 '15 at 16:01
  • 1
    This solution doesn't seem to parse this example correctly: "I have the chance to meet Dr. House. He was with Mr. Home." – Dũng Trần Trung Apr 21 '16 at 02:40
  • The flaw in this approach is in defining a set of character classes that one deems to be the only valid components of a sentence structure. People have already noticed the issue with a non-terminating period, but even digits are excluded despite being institutionally advised in English-language prose (subject to context-specific exceptions) to use numerals to represent all (integer) numbers greater than ten, and only to spell out zero through ten. A sentence can also end with a quotation mark, and in some formal styles of documents, a parenthesis, a square bracket or a line-break character. – CJK Jun 14 '19 at 12:29
30

An example regex to match sentences by the definition: "A sentence is a series of characters, starting with at lease one whitespace character, that ends in one of ., ! or ?" is as follows:

\s+[^.!?]*[.!?]

Regular expression visualization

Note that newline characters will also be included in this match.

Taylor Hx
  • 2,815
  • 23
  • 36
  • You don't need to escape `.` in character classes. Also, this doesn't guarantee anything about where this pattern occurs in the string. – Steve P. Dec 02 '13 at 04:30
  • That is not a good definition. What if we have a decimal number or a name initial in the sentence? – wiktus239 Nov 13 '14 at 10:01
  • @wiktus239 You're right, it's not the best definition. Steve P's definition is better and thus his answer was accepted. – Taylor Hx Nov 13 '14 at 23:11
  • 1
    @TaylorHx But where you get this nice regExp visualization? – fdrv Mar 16 '16 at 11:59
  • What was the source for that definition, and has the author ever read, say, a book ? I would go as far to say that this not only fails as a definition, but doesn't even come satisfactorily close to an approximation, given that two out of its three assertions are demonstrably false in non-trivial cases both individually and together. – CJK Jun 14 '19 at 12:15
  • This also doesn't solve for "..." – mattgabor May 25 '22 at 18:14
3

A sentence starts with a word boundary (hence \b) and ends with one or more terminators. Thus:

\b[^.!?]+[.!?]+

https://regex101.com/r/7DdyM1/1

This gives pretty accurate results. However, it will not handle fractional numbers. E.g. This sentence will be interpreted as two sentences:

The value of PI is 3.141...
l33t
  • 18,692
  • 16
  • 103
  • 180
2

If you looking to match all strings starting with a white space you can try using "^\s+*" regular expression.

This tool could help you to test your regular expression efficiently.

http://www.rubular.com/

Ashish
  • 39
  • 2
1

Based upon what you desire and asked for, the following will work.

String s  = "    hello there I like regex!";
Pattern p = Pattern.compile("^\\s+[a-zA-Z\\s]+[.?!]$");
Matcher m = p.matcher(s); 
if (m.matches()) {
    System.out.println("hurray!");
}

See working demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Why are you using a look-ahead here? – Steve P. Dec 02 '13 at 04:34
  • It wasn't incorrect, there just didn't seem to be a use for it. Wasn't sure if you thought it was an optimization or something. It some cases using an atomic group can be an optimization, but not in this case (I think). – Steve P. Dec 02 '13 at 04:38
1
String regex = "(?<=^|(\.|!|\?) |\n|\t|\r|\r\n) *\(?[A-Z][^.!?]*((\.|!|\?)(?! |\n|\r|\r\n)[^.!?]*)*(\.|!|\?)(?= |\n|\r|\r\n)"

This match any sentence following the definition 'a sentence start with a capital letter and end with a dot'.

1

The below regex pattern matches sentences in a paragraph.

Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");

Reference: https://devsought.com/regex-pattern-to-match-sentence

John Kyalo
  • 31
  • 3
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 10 '22 at 13:56
0

This pattern takes account abbrevations also, considering that next sentence also begins with capital letter:

  ((?:[A-ZΆ-Ω0-9][\S\s]+?)+?[a-zά-ω0-9][.!?;]+)(?= [A-ZΆ-Ω0-9]|$)

Includes greek char range also. Test Here.

anefeletos
  • 672
  • 7
  • 19