1

I have enormous text. The goal is to separate dots with spaces in this which are only in the end of sentences, but not in abbreviations, time, date or else. Doing like this:

    String regex = "[a-z](\\.)\\s";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    if(matcher.find())
        text = text.replace(matcher.group(1), " " + matcher.group(1));

The result is not only things like "The end of sentence . Next sentence . ", but things like this as well: "Some numeric info 16 . 15 shouldn't match this regex . ".

Antos
  • 23
  • 4
  • 1
    Can you give a whole sentence as an example and highlight what you would like to match and what you would like not to match? If I understand you correctly you want to match all dots but those which are followed by a whitespace? – Korgen Feb 25 '16 at 15:46
  • http://stackoverflow.com/questions/20320719/constructing-regex-pattern-to-match-sentence – Adam Feb 25 '16 at 15:54
  • @Korgen Text could be smth like this: "That cat weights 5.7 kilos. Quite medium cat." I want to match dots near words "kilos." and "cat." and make them "kilos . " and "cat . ". My regex does it to "5 . 7" as well. – Antos Feb 25 '16 at 16:18
  • @Antos you regex works correcly--it finds the dot after `kilos`. But `String#replace()` has no idea that you need to replace *just that particular dot*, so does it for *all* dots in the text. – Alex Salauyou Feb 25 '16 at 16:20
  • @SashaSalauyou Yes, good point. Right now i'm trying to change it. Thank you. – Antos Feb 25 '16 at 16:24
  • @SashaSalauyou replaceAll() does it in a wrong way. Can't understand start() and end() methods... Now using `code` text = text.replace(matcher.group(), matcher.group().charAt(0) + " . "); But some dots remain unchanged. – Antos Feb 25 '16 at 16:30

1 Answers1

0

I'd suggest using Matcher#replaceAll() for this:

Pattern regex = Pattern.compile("([a-z])\\.(\\s|$)");
text = regex.matcher(text).replaceAll("$1 .$2");    // $1 is for letter, $2 is for space/end of line

The same thing using lookbehind (?<=):

Pattern regex = Pattern.compile("(?<=[a-z])\\.(\\s|$)");
text = regex.matcher(text).replaceAll(" .$1");          // $1 now is for space/end of line
Alex Salauyou
  • 14,185
  • 5
  • 45
  • 67