4

I wrote some RegEx to play with spaces in strings, and it works beautifully, except for when I come across this character: " " instead of " ". You probably think I'm crazy, but apparently they're different. Check out this RegEx app (oddly enough, it often crashes it):

When I use the weird space:

enter image description here

When I use a normal space:

enter image description here

As you can see, there are many more spaces detected here, but it doesn't detect the weird spaces.

What is this space? How do I get rid of it?

Mick MacCallum
  • 129,200
  • 40
  • 280
  • 281
Doug Smith
  • 29,668
  • 57
  • 204
  • 388
  • The spaces in your post *are* the same. Can you post the correct (presumably unicode) one? – Carl Norum Jul 31 '13 at 22:31
  • You will have to view the raw text in Hex Mode to ASCII number. – Black Frog Jul 31 '13 at 22:32
  • There are many ascii characters that have no visible character art (essentially looks like a space) that are not the actual space (ascii 32) character. – Lochemage Jul 31 '13 at 22:36
  • I would guess it's some type of Unicode Space like non-breaking space (U+00A0). – Jeffery Thomas Jul 31 '13 at 22:44
  • Looks like you have some word-processed prose there. Is it possible the weird spaces you are seeing are non-breaking spaces. so maybe your wp or text layout app has a special space character that will not be replaced by a line end. e.g the phrase "add up to a movie star" might be split over two lines, but the weird space is saying the line break should not happen between the "to" and "a". I don't know what app you've used to create the text, but maybe there's a setting in that app to not include these non-breaking spaces. Oops Jeffery Thomas has entered the same reply a few seconds before me. – Derek Knight Jul 31 '13 at 22:45

4 Answers4

2

Unicode has a lot of different space characters. The space you posted in your question -- in both the title and the body -- is a regular ASCII space, good old U+0020.

If you want to check exactly what you've copied onto your clipboard, you can run the command pbpaste(1) on Mac OS X. For example, if you copied a non-breaking space (U+00A0), you could identify it like so:

# Write pasteboard contents to stdout, convert from UTF-8 to UTF-32 for easy
# code point identification, then hex dump the contents
$ pbpaste | iconv -f utf-8 -t utf-32be | hexdump -C
00000000  00 00 00 a0                                       |....|
00000004

Depending on the regex engine you're using, it may not support them all, especially if you use the \s character class. If you want to be sure to match the space character you have, then include it explicitly in your character class, e.g. [\s<YOURSPACEHERE>], where <YOURSPACEHERE> is copy+pasted from the character you want to match.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
1

Try "\p{Z}" for your regular expression. It's the unicode property for any kind of whitespace or invisible separator.

See: NSRegularExpression and Unicode Regular Expressions.


Just as a test of my answer, I constructed the following unit test.

- (void)testPattern
{
    NSString *string = @"xxx\u00A0yyy";
    NSString *pattern = @"\\p{Z}";
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:0 error:NULL];

    NSUInteger number = [regex numberOfMatchesInString:string options:0 range:NSMakeRange(0, [string length])];
    STAssertEquals(number, 1U, @"");
}
Jeffery Thomas
  • 42,202
  • 8
  • 92
  • 117
0

They're probably non-breaking spaces, seeing as all the lines end with spaces that are matched by \s rather than these mystery spaces. Try matching \0xA0.

Adi Inbar
  • 12,097
  • 13
  • 56
  • 69
0

You can match Unicode characters with \x{NNNN}, where NNNN is the Hexa code of the character. See ICU User Guide.

phbardon
  • 127
  • 9