How do I check for this odd space character - " " in Objective-C?

Question

I wrote some RegEx to play with spaces in strings, and it works beautifully, except for when I come across this character: " " instead of " ". You probably think I'm crazy, but apparently they're different. Check out this RegEx app (oddly enough, it often crashes it):

When I use the weird space:

enter image description here

When I use a normal space:

enter image description here

As you can see, there are many more spaces detected here, but it doesn't detect the weird spaces.

What is this space? How do I get rid of it?

The spaces in your post *are* the same. Can you post the correct (presumably unicode) one? — Carl Norum, Jul 31 '13 at 22:31
You will have to view the raw text in Hex Mode to ASCII number. — Black Frog, Jul 31 '13 at 22:32
There are many ascii characters that have no visible character art (essentially looks like a space) that are not the actual space (ascii 32) character. — Lochemage, Jul 31 '13 at 22:36
I would guess it's some type of Unicode Space like non-breaking space (U+00A0). — Jeffery Thomas, Jul 31 '13 at 22:44
Looks like you have some word-processed prose there. Is it possible the weird spaces you are seeing are non-breaking spaces. so maybe your wp or text layout app has a special space character that will not be replaced by a line end. e.g the phrase "add up to a movie star" might be split over two lines, but the weird space is saying the line break should not happen between the "to" and "a". I don't know what app you've used to create the text, but maybe there's a setting in that app to not include these non-breaking spaces. Oops Jeffery Thomas has entered the same reply a few seconds before me. — Derek Knight, Jul 31 '13 at 22:45

score 2 · Accepted Answer · answered Jul 31 '13 at 22:43

Unicode has a lot of different space characters. The space you posted in your question -- in both the title and the body -- is a regular ASCII space, good old U+0020.

If you want to check exactly what you've copied onto your clipboard, you can run the command pbpaste(1) on Mac OS X. For example, if you copied a non-breaking space (U+00A0), you could identify it like so:

# Write pasteboard contents to stdout, convert from UTF-8 to UTF-32 for easy
# code point identification, then hex dump the contents
$ pbpaste | iconv -f utf-8 -t utf-32be | hexdump -C
00000000  00 00 00 a0                                       |....|
00000004

Depending on the regex engine you're using, it may not support them all, especially if you use the \s character class. If you want to be sure to match the space character you have, then include it explicitly in your character class, e.g. [\s<YOURSPACEHERE>], where <YOURSPACEHERE> is copy+pasted from the character you want to match.

It is indeed an a0 according to that command, thank you. I'll try the [\s...] suggestion. :) — Doug Smith, Jul 31 '13 at 22:50
And hey, Xcode even seems to have a special symbol for it. http://i.imgur.com/a6G4gQO.png — Doug Smith, Jul 31 '13 at 22:52

Jeffery Thomas · Answer 2 · 2013-07-31T23:05:55.923

1

Try "\p{Z}" for your regular expression. It's the unicode property for any kind of whitespace or invisible separator.

See: NSRegularExpression and Unicode Regular Expressions.

Just as a test of my answer, I constructed the following unit test.

- (void)testPattern
{
    NSString *string = @"xxx\u00A0yyy";
    NSString *pattern = @"\\p{Z}";
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:0 error:NULL];

    NSUInteger number = [regex numberOfMatchesInString:string options:0 range:NSMakeRange(0, [string length])];
    STAssertEquals(number, 1U, @"");
}

edited Jul 31 '13 at 23:05

answered Jul 31 '13 at 22:40

Jeffery Thomas

42,202
8
92
117

Switch away from PHP regular expressions. – Jeffery Thomas Jul 31 '13 at 22:52
Sorry not a complete thought, inside Objective-C, is what I meant. – Jeffery Thomas Jul 31 '13 at 23:10

score 0 · Answer 3 · answered Jul 31 '13 at 22:45

0

They're probably non-breaking spaces, seeing as all the lines end with spaces that are matched by \s rather than these mystery spaces. Try matching \0xA0.

answered Jul 31 '13 at 22:45

Adi Inbar

12,097
13
56
69

Apologies, it is A0, but for whatever reason the app couldn't find it. – Doug Smith Jul 31 '13 at 22:53

score 0 · Answer 4 · answered Apr 12 '14 at 08:06

0

You can match Unicode characters with \x{NNNN}, where NNNN is the Hexa code of the character. See ICU User Guide.

answered Apr 12 '14 at 08:06

phbardon

127
9

How do I check for this odd space character - " " in Objective-C?

4 Answers4