47

I've input an input file which I need to process and discard all the white-spaces, including non-breaking space U+00A0 aka   (You can produce it in Notepad by pressing Alt and then typing 0 1 6 0 from the keyboard's numeric pad.) or any other form of white space. I have tried String.trim() but it doesn't trim U+00A0.

Do I need to explicitly check for U+00A0 and then trim() or is there an easy way to trim all kinds of white-spaces in Java?

Abhishek
  • 6,912
  • 14
  • 59
  • 85

5 Answers5

77

While   is a non breaking space (a space that does not want to be treated as whitespace), you can trim a string while preserving every   within the string with a simple regex:

string.replaceAll("(^\\h*)|(\\h*$)","")
  • \h is a horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]

If you are using a pre JDK8 Version, you need to explicitly use the list of chars instead of \h.

Community
  • 1
  • 1
Cfx
  • 2,272
  • 2
  • 15
  • 21
  • 1
    This is the cleanest and most general solution so far. Worth mentioning that `\h` is only available since Java 8 but in earlier versions, you can use the explicit range given in your answer. – 5gon12eder Feb 03 '15 at 09:55
  • That's brilliant! Exactly a one-liner which will take care of all kinds of spaces. – Abhishek Feb 03 '15 at 09:55
  • One thing that might be helpful to know is that these have a Unicode Classification of Space Separator. I like this page as a reference to what's included, as the Unicode official stuff is a bit dry: [Space Separators](http://www.fontspace.com/unicode/category/space-separator) – Steve Jun 30 '16 at 10:33
  • 1
    Why do you think `\u00A0` is not removed? Please see the [javadoc](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). It states that this char is included in `\h` and therefore removed. – Cfx Mar 14 '17 at 10:12
  • 1
    @Roland `Cfx` answered it for me :) – Abhishek Mar 14 '17 at 13:35
  • 1
    @Abhishek My bad, I didn't realize that `\xA0` == `'\u00A0'`. – Roland Mar 14 '17 at 14:00
  • And what about \r\n, \r and \n ? it is not escapted with your answer – Mattew Eon Aug 09 '21 at 08:26
  • 1
    Well, when \h is used for horizontal whitespace, one could guess that \v would be used for vertical whitespace ([\n\x0B\f\r\x85\u2028\u2029]). Or you would just look in the [javadoc](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). **;-)** – Cfx Aug 19 '21 at 08:26
48

U+0160 is not whitespace, so it won't be trimmed. But you can simply replace() that characters with a space, and then call trim(), so you keep the spaces that are 'inside' your string.

string = string.replace('\u00A0',' ').trim()

There are three non-breaking whitespace characters that are excluded from the Character.isWhitespace() method : \u00A0, \u2007 and, \u202F, so you probably want to replace those too.

Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121
  • 1
    It worked!! Thanks :) I assume, I need to handle all the whitespaces (http://en.wikipedia.org/wiki/Whitespace_character) explicitly & one-by-one, right? – Abhishek Feb 03 '15 at 09:47
  • 1
    `trim()` will take care of all the characters that are listed as java whitespace, so you don't need to add all whitespace characters. See here: http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-char- – Rob Audenaerde Feb 03 '15 at 09:50
  • But this will change the "inner" NBSPs to normal spaces, might not be what you want. "&nbsp foo&nbsp35 nbsp" will become "foo 35" not "foo&nbsp" which would be expected of trim. – Viktor Mellgren May 04 '21 at 09:07
  • 1
    @ViktorMellgren yes, but OP asked for: I've input an input file which I need to process and *discard all the white-spaces* – Rob Audenaerde May 05 '21 at 07:38
14

You can try this:

string.replaceAll("\\p{Z}","");

From https://www.regular-expressions.info/unicode.html:

\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.

logbasex
  • 1,688
  • 1
  • 16
  • 22
4

If you happen to use Apache Commons Lang then you can use strip and add all the characters you want.

final String STRIPPED_CHARS = " \t\u00A0\u1680\u180e\u2000\u200a\u202f\u205f\u3000";

String s = "\u3000 \tThis str contains a non-breaking\u00A0space and a\ttab. ";
s = StringUtils.strip(s, STRIPPED_CHARS);  
System.out.println(s);  // Gives : "This str contains a non-breaking space and a    tab."
ForguesR
  • 3,558
  • 1
  • 17
  • 39
3

You could do it with a guava CharMatcher, for example:

CharMatcher.anyOf("\r\n\t \u00A0").trimFrom(input);
CharMatcher.whitespace().trimFrom(input);

See also this nice reference on whitespaces definition