14

I consider myself pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky: I want to trim all whitespace, except the space character: ' '.

In Java, the RegEx I have tried is: [\s-[ ]], but this one also strips out ' '.

UPDATE:

Here is the particular string that I am attempting to strip spaces from:

project team                manage key

Note: it would be the characters between "team" and "manage". They appear as a long space when editing this post but view as a single space in view mode.

ColinD
  • 108,630
  • 30
  • 201
  • 202
Ryan Delucchi
  • 7,718
  • 13
  • 48
  • 60
  • You can replace all spaces with a character you know won't be present, remove all whitespaces and change the special character back to a space. – Peter Lawrey Feb 04 '11 at 23:31
  • True (this trick actually already occurred to me) and I suspect it would work but would require three replacements instead of one. – Ryan Delucchi Feb 04 '11 at 23:42
  • 1
    So... uh... you would want your output string to read `project teammanage key`? – CanSpice Feb 05 '11 at 00:53
  • 1
    You'd better tell us what character it is. We see only a lot of spaces. It may be anything. Note that `\s` in Java doesn't cover all Unicode spaces, see my comment below and http://spreadsheets.google.com/pub?key=pd8dAQyHbdewRsnE5x5GzKQ – maaartinus Feb 05 '11 at 00:56
  • In the debugger: this character is showing as "32" – Ryan Delucchi Feb 05 '11 at 01:08
  • 1
    As a decimal value, 32 *is* a space (in Unicode and ASCII); as a hex value it's the character '2'. So I think you are confused w/r/t what's between team and manage. – Lawrence Dol Feb 05 '11 at 01:34
  • Ok, folks, I stand corrected. There was some messed up application behavior masking the underlying issue. Specifically, these **were** spaces, but for some reason they were displaying as one space in some instances and multiple spaces in others. So yes: Issue resolved now. Thanks folks. – Ryan Delucchi Feb 06 '11 at 06:57
  • Unfortunately, the current Java definition of `\s` is for ASCII only, not for Unicode, Java’s native character set. Therefore its definition of `\S` is also wrong for Unicode. However, this is [comparitively easily fixed](http://stackoverflow.com/questions/4731055/whitespace-matching-regex-java/4731164#4731164). – tchrist Apr 16 '11 at 20:31

4 Answers4

35

Try using this regular expression:

[^\S ]+

It's a bit confusing to read because of the double negative. The regular expression [\S ] matches the characters you want to keep, i.e. either a space or anything that isn't a whitespace. The negated character class [^\S ] therefore must match all the characters you want to remove.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Good thought, but this didn't actually work. This seems logically equivalent to my original attempt (which uses subtraction). I'm beginning to think that I need to specify ranges and/or specific characters to strip, which is unfortunate. – Ryan Delucchi Feb 04 '11 at 23:47
  • I should note that if I do a replace all on "\s" it strips out all the whitespace but it also strips out ' ', which is too aggressive. – Ryan Delucchi Feb 04 '11 at 23:49
  • It must work.... and it does. Try `System.out.println("\t aaa \t\n".replaceAll("[^\\S ]", "").getBytes());`. – maaartinus Feb 04 '11 at 23:55
  • @Mark Byers Yes, I used uppercase "S" and I am aware of the need for double-backslashes. @maaartinus: the whitespace I am trying to remove here are not tabs ... honestly I'm not sure what the exact whitespace chars are, but I **do** know that these are not tabs or new lines **and** "\\s" is able to strip them out. – Ryan Delucchi Feb 05 '11 at 00:28
  • @Ryan Delucchi: Can you print the string out to a file and then copy and paste it into your question so that we can see exactly what characters you are trying to remove? – Mark Byers Feb 05 '11 at 00:31
  • `I'm not sure what the exact whitespace chars are` how about hex-dumping the string so you can find out what exactly they are (and tell us)? – Stephen P Feb 05 '11 at 00:56
  • Use e.g. `getBytes("utf-8")` and output them as numbers for finding out what it is. – maaartinus Feb 05 '11 at 00:57
  • @maaartinus: I just did a getBytes("utf-8") and I get the value: 32 – Ryan Delucchi Feb 05 '11 at 01:11
  • @ColinD: I wonder if these spaces are being translated. Because, whatever space-character I am dealing with "\\s" matches it but can't be distinguished from a regular space ... in RegEx at least. – Ryan Delucchi Feb 05 '11 at 01:21
  • @Ryan: Besides space, \s only matches character tab, line tab, form feed, line feed and carriage return. – ColinD Feb 05 '11 at 03:26
  • All the answers here were useful, but I ended up using this one. – Ryan Delucchi Feb 06 '11 at 06:58
7

Using a Guava CharMatcher:

String text = ...
String stripped = CharMatcher.WHITESPACE.and(CharMatcher.isNot(' '))
    .removeFrom(text);

If you actually just want that trimmed from the start and end of the string (like String.trim()) you'd use trimFrom rather than removeFrom.

ColinD
  • 108,630
  • 30
  • 201
  • 202
  • The definition of whitespace here differs from the one used by \s. It is better (closer to Unicode standard). – maaartinus Feb 04 '11 at 23:52
3

There's no subtraction of character classes in Java, otherwise you could use [\s--[ ]], note the double dash. You can always simulate set subtraction using intersection with the complement, so

[\s&&[^ ]]

should work. It's no better than [^\S ]+ from the first answer, but the principle is different and it's good to know both.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
1

I solved it with this:

anyString.replace(/[\f\t\n\v\r]*/g, '');

It is just a collection of all possible white space characters excluding blank (so actually \s without blanks). It includes tab, carriage return, new line, vertical tab and form feed characters.

seawave_23
  • 1,169
  • 2
  • 12
  • 23