0

I'm using the Scanner class in java to go through a a text file and extract each sentence. I'm using the setDelimiter method on my Scanner to the regex:

Pattern.compile("[\\w]*[\\.|?|!][\\s]")

This currently seems to work, but it leaves the whitespace at the end of the sentence. Is there an easy way to match the whitespace at the end but not include it in the result?

I realize this is probably an easy question but I've never used regex before so go easy :)

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Gary
  • 926
  • 3
  • 12
  • 24

2 Answers2

5

Try this:

"(?<=[.!?])\\s+"

This uses lookarounds to match \\s+ preceded by [.!?].


If you want to remove the punctuations as well, then just include it as part of the match:

"[.!?]+\\s+"

This will split "ORLY!?!? LOL" into "ORLY" and "LOL"

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • this only matches words, but does not stop at the end of a sentence. thanks for trying though! – Gary Apr 16 '10 at 01:48
  • @Gary: sorry, now fixed. Try again. – polygenelubricants Apr 16 '10 at 01:51
  • that does everything but remove the period at the end! is there an easy way to remove the period with regex or should i just alter the string afterward? Edit: forgot to say that I was also wanting to ignore commas, should i do this in regex or manually? – Gary Apr 16 '10 at 01:56
  • What do you mean by ignore commas? Right now this regex doesn't consider commas as sentence delimiters. Do you want it to? – polygenelubricants Apr 16 '10 at 01:59
  • Nevermind, on further thought: it probably isn't the job of this regex to do that. Thanks a lot for your help :) – Gary Apr 16 '10 at 02:01
0

What you're looking for is a positive lookahead. This should do it:

Pattern.compile("\\w*[.?!](?=\\s)")
Wolph
  • 78,177
  • 11
  • 137
  • 148
  • Thanks for your help but that didn't seem to work.. My original one produced the following with two sentences (note the spaces at the end): "The quick brown fox jumps over the lazy " "Here is another sentence that will go in the test " Yours seemed to produce the following: "The quick brown fox jumps over the lazy " " Here is another sentence that will go in the test " – Gary Apr 16 '10 at 01:38
  • Just realised that the last word is also going missing, any idea why? – Gary Apr 16 '10 at 01:39
  • 2
    @WoLpH: Shouldn't that be Pattern.compile("\\w*[.?!](?=\\s)"), given that there are different semantics for expressions inside character classes as opposed to normal? – ig0774 Apr 16 '10 at 01:41
  • Indeed ig0774, I'll change it. – Wolph Apr 16 '10 at 01:59
  • @Gary: try the revised version. The original regex had a few flaws – Wolph Apr 16 '10 at 02:01