Whitespace Matching Regex - Java

Question

The Java API for regular expressions states that \s will match whitespace. So the regex \\s\\s should match two spaces.

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.

Am I having a grave misunderstanding of regexes or the term "whitespace"?

String has a replaceAll function that will save you a few lines of code. http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html — Zach L, Jan 19 '11 at 02:05
It isn’t your misunderstanding, but Java’s. Try splitting a string like `"abc \xA0 def \x85 xyz"` to see what I mean: there are only three fields there. — tchrist, Apr 11 '11 at 15:15
Did you try "\\s+". With this you replace two or more spaces to one. — hrzafer, May 05 '13 at 12:33
I've been wondering for over an hour why my \\s split is not splitting over whitespace. Thanks a million! — Marcin, May 18 '14 at 00:28

score 218 · Answer 1 · edited Jul 10 '19 at 15:45

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode defines 26 code points as \p{White_Space}: 20 of them are various sorts of \pZ GeneralCategory=Separator, and the remaining 6 are \p{Cc} GeneralCategory=Control.

White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as the pattern in your replaceAll.

Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.

And if you think white space is bad, you should see what you have to do to get \w and \b to finally behave properly!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.

@Glenn: CASE_INSENSITIVE does not affect whether the charclass abbreviations work on ASCII vs Unicode. — tchrist, Jan 19 '11 at 14:07
this is really old. is it correct that this was fixed in java7 with the UNICODE_CHARACTER_CLASS flag? (or using (?U)) — kritzikratzi, Jul 27 '14 at 23:41
@tchrist can u help me with this., I wrote a java regex in this way- abc def(\\sthe) ghi. I want the following patterns to be recognised-"abc def the ghi" and "abc def ghi".But only second pattern is being recognized.What am imissing? — AV94, Sep 14 '15 at 14:19
@tchrist If this is fixed in java 7+, could you update the answer with the now-correct way to do this? — beerbajay, Dec 29 '15 at 14:04
I included this one in the list as well: "\\u200B" // ZERO-WIDTH SPACE — JM Lord, Jun 01 '16 at 12:59
With Java 7+ you can do: "(?U)\s" to run the regex with Unicode Technical Standard conformance. Or you can make the UNICODE_CHARACTER_CLASS flag true when creating the pattern. Here's the doc: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS — Didier A., Sep 05 '16 at 00:28
Thank you, for some reason when I tried scraping data Java would not recognize the space. This worked flawlessly! — tomSurge, Nov 16 '16 at 18:10
This is overly complicated in today's Java. See my answer below. Use \p{Zs}. — Robert, Oct 24 '19 at 11:44

score 46 · Accepted Answer · edited May 31 '20 at 09:06

46

Yeah, you need to grab the result of matcher.replaceAll():

String result = matcher.replaceAll(" ");
System.out.println(result);

edited May 31 '20 at 09:06

Neuron

5,141
5
38
59

answered Jan 19 '11 at 02:02

Raph Levien

5,088
25
24

19

Gah. I feel like the biggest idiot on earth. Neither I nor two other people seemed to notice that. I guess the stupidest little errors throw us off sometimes, eh? – Jan 19 '11 at 02:09
So true! I guess that happens with the best of them – saibharath Sep 19 '14 at 15:02
What happen if I need get if the text had White Spaces.? – Gilberto Ibarra Aug 05 '16 at 22:35
1

Per my answer below use \p{Zs} instead of \s if you want to match unicode whitespace. – Robert Oct 24 '19 at 11:45

surfealokesea · Answer 3 · 2013-06-11T16:11:53.740

18

For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

edited Jun 11 '13 at 16:11

answered Jun 11 '13 at 10:27

surfealokesea

4,971
4
28
38

Strings are immutable, thus you have to assign the result to something, such as 'txt = txt.replaceAll()' I did not vote-down your answer, but that might be why someone else did so. – Enwired Oct 04 '13 at 20:26
7

I know replaceAll returns a string the important thing 4 java programers is\\p{javaSpaceChar} – surfealokesea Oct 06 '13 at 17:53
2

The original question made the mistake of not assigning the new string to a variable. Pointing out that mistake is thus the most important point of the answer. – Enwired Oct 07 '13 at 16:17
This totally solved my problem in Groovy! Finally! Been trying every regex I could find that would match all white space including NON-BREAK-SPACE (ASCII 160)!!! – Piko Nov 21 '17 at 18:22

Robert · Answer 4 · 2021-08-21T11:24:01.180

11

Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs} group.

Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

Also worth knowing, if you've used the trim() string function you should take a look at the (relatively new) strip(), stripLeading(), and stripTrailing() functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java's Character.isWhitespace() function.

edited Aug 21 '21 at 11:24

answered Oct 24 '19 at 11:43

Robert

1,220
16
19

As a heads up, this does not match newlines, but [this](https://stackoverflow.com/a/69144404/1858327) does. – Captain Man Mar 24 '23 at 21:40
1

@CaptainMan the answer you reference leaves out a small note from the JavaDoc: "Specifying this flag may impose a performance penalty." To avoid that performance hit I would suggest `\p{Zl}` for line separators and `\p{Zp}` for paragraph separators. – Robert Mar 25 '23 at 17:04
1

Expanded: `txt.replaceAll("(\\p{Zs}|\\p{Zl}|\\p{Zp})+", " ");` to replace all sorts of separators with a single space character. – Robert Mar 25 '23 at 17:05

score 6 · Answer 5 · answered Nov 03 '14 at 12:01

when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:

"Message author: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."

`[\s\p{z}]` omits Unicode "next line" character U+0085. Use `[\s\u0085\p{Z}]`. — Robert Tupelo-Schneck, Sep 22 '15 at 15:14

score 5 · Answer 6 · answered Jan 19 '11 at 02:01

Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

score 4 · Answer 7 · edited Jan 29 '20 at 10:00

For your purpose you can use this snnippet:

import org.apache.commons.lang3.StringUtils;

StringUtils.normalizeSpace(string);

This will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.

String sampleString = "Hello    world!";
sampleString.replaceAll("\\s{2}", " "); // replaces exactly two consecutive spaces
sampleString.replaceAll("\\s{2,}", " "); // replaces two or more consecutive white spaces

score 4 · Answer 8 · answered Sep 11 '21 at 15:32

To match any whitespace character, you can use

Pattern whitespace = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS);

The Pattern.UNICODE_CHARACTER_CLASS option "enables the Unicode version of Predefined character classes and POSIX character classes" that are then "in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties".

The same behavior can also be enabled with the (?U) embedded flag expression. For example, if you want to replace/remove all Unicode whitespaces in Java with regex, you can use

String result = text.replaceAll("(?U)\\s+", ""); // removes all whitespaces
String result = text.replaceAll("(?U)\\s", "-"); // replaces each single whitespace with -
String result = text.replaceAll("(?U)\\s+", "-"); // replaces chunks of one or more consecutive whitespaces with a single -
String result = text.replaceAll("(?U)\\G\\s", "-"); // replaces each single whitespace at the start of string with -

See the Java demo online:

String text = "\u00A0 \u00A0\tStart reading\u00A0here..."; // \u00A0 - non-breaking space
System.out.println("Text: '" + text + "'"); // => Text: '       Start reading here...'
System.out.println(text.replaceAll("(?U)\\s+", "")); // => Startreadinghere...
System.out.println(text.replaceAll("(?U)\\s", "-")); // => ----Start-reading-here...
System.out.println(text.replaceAll("(?U)\\s+", "-")); // => -Start-reading-here...
System.out.println(text.replaceAll("(?U)\\G\\s", "-")); // => ----Start reading here...

score 3 · Answer 9 · answered Sep 15 '11 at 12:51

3

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);

boolean flag = true;
while(flag)
{
 //Update your original search text with the result of the replace
 modLine = matcher.replaceAll(" ");
 //reset matcher to look at this "new" text
 matcher = whitespace.matcher(modLine);
 //search again ... and if no match , set flag to false to exit, else run again
 if(!matcher.find())
 {
 flag = false;
 }
}

answered Sep 15 '11 at 12:51

Mike

31
1

3

Mike, while I appreciate you taking the time to answer, this question has been solved several months ago. There is no need to answer questions as old as this. – Sep 15 '11 at 14:53
9

If someone can show a different, better solution, answering old questions is perfectly legit. – james.garriss Apr 27 '15 at 14:09

score 0 · Answer 10 · answered Nov 07 '22 at 20:22

0

You can use simpler:

String out = in.replaceAll(" {2}", " ");

answered Nov 07 '22 at 20:22

Bokili Production

368
1
10

score -3 · Answer 11 · answered Jan 19 '11 at 04:10

Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:

import java.util.regex.*;

public class Two21WS {
    private String  str = "";
    private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces

    public Two21WS (String s) {
            StringBuffer sb = new StringBuffer();
            Matcher matcher = pattern.matcher (s);
            int startNext = 0;
            while (matcher.find (startNext)) {
                    if (startNext == 0)
                            sb.append (s.substring (0, matcher.start()));
                    else
                            sb.append (s.substring (startNext, matcher.start()));
                    sb.append (" ");
                    startNext = matcher.end();
                    //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
                    //                      ", sb: \"" + sb.toString() + "\"");
            }
            sb.append (s.substring (startNext));
            str = sb.toString();
    }

    public String toString () {
            return str;
    }

    public static void main (String[] args) {
            String tester = " a    b      cdef     gh  ij   kl";
            System.out.println ("Initial: \"" + tester + "\"");
            System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

It produces the following (compile with javac and run at the command prompt):

% java Two21WS Initial: " a b cdef gh ij kl" Two21WS: " a b cdef gh ij kl"

WTF!? Why would you want to do all that when you can just call `replaceAll()` instead? — Alan Moore, Jan 19 '11 at 11:47

Whitespace Matching Regex - Java

11 Answers11

Linked

Related