4

I have String like "12 345 678" and I wanted to remove whitespaces (because of conversion to int). So I did the usual: myString.replaceAll("\\s", "");, but what a surprise! It did nothing, the space was still there.

When I investigated further, I figured out that this space character is of type Character.SPACE_SEPARATOR (Character.getType(myString.charAt(<positionOfSpaceChar>))).

What I don't get is why isn't this oblivious space character (from Unicode category Zs http://www.fileformat.info/info/unicode/category/Zs/list.htm) recognized as whitespace (not even with Character.isWhitespace(char)).

Reading through java api isn't helpful (so far).

note: In the end, I just want to remove that character... and I will probably find a way how to do it, but I'm really interested in some explanation of why it's behaving like this. Thanks

rax
  • 152
  • 1
  • 7
  • Strings are immutable. Did you assign the return value of the method to another String variable? – camickr Jun 17 '13 at 03:50
  • Yes. The problem is, that those spaces aren't just "ordinary spaces". They are probably non-breaking spaces or something like this. I found a solution of my problem here: http://stackoverflow.com/questions/1060570/why-is-non-breaking-space-not-a-whitespace-character-in-java `replaceAll(\\p{javaSpaceChar}", "_"))` But I couldn't find there some satysfying explanation why it is like this... – rax Jun 17 '13 at 03:51
  • 1
    The Javadoc for `java.util.regex.Pattern` states that `\s` means `[ \t\n\x0B\f\r]`, and the Javadoc for `java.lang.Character.isWhitespace` states that it does not include non-breaking spaces. – ruakh Jun 17 '13 at 03:58

1 Answers1

9

Your problem is that \s is defined as [ \t\n\x0B\f\r]. What you want to use is \p{javaWhitespace}, which is defined as all characters for which java.lang.Character.isWhitespace() is true.

Not sure if it applies in this case, but note that a non-breaking space is not considered whitespace. Character.SPACE_SEPARATOR is generally whitespace, but '\u00A0', '\u2007', '\u202F' are not included because they are non-breaking. If you want to include non-breaking spaces, then include those 3 characters explicitly in addition to \p{javaWhitespace}. It's kind of a pain, but that's the way it is.

Actually, in your specific case of converting to int, I'd recommend:

myString.replaceAll("\\D", "");,

to strip out everything that is not a digit.

Old Pro
  • 24,624
  • 7
  • 58
  • 106
  • Thanks for explanation. Weird thing, that `Character.isWhitespace(char)` returned false (when I tried it), but `\p{javaWhitespace}` behaves correctly in my case (this was the solution i went for). But using \D is much better. I wonder how could I miss so oblivious solution. Thanks! – rax Jun 17 '13 at 05:13