5

How can I enumerate NSString by pulling each unichar out of it? I can use characterAtIndex but that is slower than doing it by an incrementing unichar*. I didn't see anything in Apple's documentation that didn't require copying the string into a second buffer.

Something like this would be ideal:

for (unichar c in string) { ... }

or

unichar* ptr = (unichar*)string;
jjxtra
  • 20,415
  • 16
  • 100
  • 140
  • If you're so worried about performance, you'd be better using NSData and accessing the byte array of that. – joerick Apr 17 '12 at 21:04
  • It turns out that CFString actually has a way to do this, in CFStringGetCharactersPtr... – Richard J. Ross III Apr 17 '12 at 21:11
  • 2
    "... but that is going to be slower than ..." - this is called **premature optimization**. You are making assumptions about performance even before you even know if the performance is going to be a problem. You should implemented it the obvious way (using `characterAtIndex`) and optimize it only if you have performance problems. – Sulthan Jul 31 '13 at 16:08
  • Already tested and found it was slower, updated question to denote that. – jjxtra Jul 31 '13 at 16:54

6 Answers6

11

You can speed up -characterAtIndex: by converting it to it's IMP form first:

NSString *str = @"This is a test";

NSUInteger len = [str length]; // only calling [str length] once speeds up the process as well
SEL sel = @selector(characterAtIndex:);

// using typeof to save my fingers from typing more
unichar (*charAtIdx)(id, SEL, NSUInteger) = (typeof(charAtIdx)) [str methodForSelector:sel];

for (int i = 0; i < len; i++) {
    unichar c = charAtIdx(str, sel, i);
    // do something with C
    NSLog(@"%C", c);
}  

EDIT: It appears that the CFString Reference contains the following method:

const UniChar *CFStringGetCharactersPtr(CFStringRef theString);

This means you can do the following:

const unichar *chars = CFStringGetCharactersPtr((__bridge CFStringRef) theString);

while (*chars)
{
    // do something with *chars
    chars++;
}

If you don't want to allocate memory for coping the buffer, this is the way to go.

Richard J. Ross III
  • 55,009
  • 24
  • 135
  • 201
  • 1
    Good find, but from the Return Value section: "A pointer to a buffer of Unicode character, or NULL if the internal storage of theString does not allow this to be returned efficiently". This would be fastest, but still needs a backup just in case. – ughoavgfhw Apr 17 '12 at 21:23
  • Brilliant, I didn't think of using CF... API, but that was a great idea. Works superbly. – jjxtra Apr 17 '12 at 21:33
  • @ughoavgfhw true, very true, it does need a backup. But for what the OP wanted, this should work fine. – Richard J. Ross III Apr 17 '12 at 21:34
  • Ended up making a category to do it, with fallback to create a backup buffer. Thanks for the answer! – jjxtra Apr 17 '12 at 21:36
4

Your only option is to copy the characters into a new buffer. This is because the NSString class does not guarantee that there is an internal buffer you can use. The best way to do this is to use the getCharacters:range: method.

NSUInteger i, length = [string length];
unichar *buffer = malloc(sizeof(unichar) * length);
NSRange range = {0,length};
[string getCharacters:buffer range:range];
for(i = 0; i < length; ++i) {
    unichar c = buffer[i];
}

If you are using potentially very long strings, it would be better to allocate a fixed size buffer and enumerate the string in chunks (this is actually how fast enumeration works).

ughoavgfhw
  • 39,734
  • 6
  • 101
  • 123
  • Hmmm. I wonder if characterAtIndex is faster given that it doesn't have to copy the memory... thoughts? – jjxtra Apr 17 '12 at 20:57
  • 3
    It's possible, but unlikely. The overhead of calling a method for each character will quickly pass the overhead of writing to memory as the size of the buffer increases. Unless of course you are using a custom NSString class which doesn't provide an optimized `getCharacters:range:` method. – ughoavgfhw Apr 17 '12 at 21:00
  • @PsychoDad I would think that using `-characterAtIndex:` *could* be faster, if you bypassed the overhead of the objc runtime, and simply used a C-function. – Richard J. Ross III Apr 17 '12 at 21:06
1

I created a block-style enumeration method that uses getCharacters:range: with a fixed-size buffer, as per ughoavgfhw's suggestion in his answer. It avoids the situation where CFStringGetCharactersPtr returns null and it doesn't have to malloc a large buffer. You can drop it into an NSString category, or modify it to take a string as a parameter if you like.

-(void)enumerateCharactersWithBlock:(void (^)(unichar, NSUInteger, BOOL *))block
{
    const NSInteger bufferSize = 16;
    const NSInteger length = [self length];
    unichar buffer[bufferSize];
    NSInteger bufferLoops = (length - 1) / bufferSize + 1;
    BOOL stop = NO;
    for (int i = 0; i < bufferLoops; i++) {
        NSInteger bufferOffset = i * bufferSize;
        NSInteger charsInBuffer = MIN(length - bufferOffset, bufferSize);
        [self getCharacters:buffer range:NSMakeRange(bufferOffset, charsInBuffer)];
        for (int j = 0; j < charsInBuffer; j++) {
            block(buffer[j], j + bufferOffset, &stop);
            if (stop) {
                return;
            }
        }
    }
}
Aaron
  • 188
  • 2
  • 8
  • True, but like I said, this handles the case where CFStringGetCharactersPtr returns null. – Aaron Feb 20 '14 at 22:29
1

The fastest reliable way to enumerate characters in an NSString I know of is to use this relatively little-known Core Foundation gem hidden in plain sight (CFString.h).

NSString *string = <#initialize your string#>
NSUInteger stringLength = string.length;
CFStringInlineBuffer buf;
CFStringInitInlineBuffer((__bridge CFStringRef) string, &buf, (CFRange) { 0, stringLength });

for (NSUInteger charIndex = 0; charIndex < stringLength; charIndex++) {
    unichar c = CFStringGetCharacterFromInlineBuffer(&buf, charIndex);
}

If you look at the source code of these inline functions, CFStringInitInlineBuffer() and CFStringGetCharacterFromInlineBuffer(), you'll see that they handle all the nasty details like CFStringGetCharactersPtr() returning NULL, CFStringGetCStringPtr() returning NULL, defaulting to slower CFStringGetCharacters() and caching the characters in a C array for fastest access possible. This API really deserves more publicity.

The caveat is that if you initialize the CFStringInlineBuffer at a non-zero offset, you should pass a relative character index to CFStringInlineBuffer(), as stated in the header comments:

The next two functions allow fast access to the contents of a string, assuming you are doing sequential or localized accesses. To use, call CFStringInitInlineBuffer() with a CFStringInlineBuffer (on the stack, say), and a range in the string to look at. Then call CFStringGetCharacterFromInlineBuffer() as many times as you want, with a index into that range (relative to the start of that range). These are INLINE functions and will end up calling CFString only once in a while, to fill a buffer. CFStringGetCharacterFromInlineBuffer() returns 0 if a location outside the original range is specified.

Costique
  • 23,712
  • 4
  • 76
  • 79
0

This will work:

char *s = [string UTF8String];
for (char *t = s; *t; t++)
  /* use as */ *t;

[Edit] And if you really need unicode characters then you have no option but to use length and characterAtIndex. From the documentation:

The NSString class has two primitive methods—length and characterAtIndex:—that provide the basis for all other methods in its interface. The length method returns the total number of Unicode characters in the string. characterAtIndex: gives access to each character in the string by index, with index values starting at 0.

So your code would be:

  for (int index = 0; index < string.length; index++)
    { 
      unichar c = [string characterAtIndex: index];
      /* ... */
    }

[edit 2]

Also, don't forget that NSString is 'toll-free bridged' to CFString and thus all the non-Objective-C, straight C-code interface functions are usable. The relevant one would be CFStringGetCharacterAtIndex

GoZoner
  • 67,920
  • 20
  • 95
  • 145
  • That only works for unicode code points less than 128. As soon as you encounter a high bit character, it'll break. Also, it's very likely to be creating a second copy of the data, which the asker was trying to avoid. – grahamparks Apr 17 '12 at 20:55
  • I assume this requires copying utf-8 bytes somehow? Where does that pointer live? Is NSString utf-8 underneath? – jjxtra Apr 17 '12 at 20:55
  • The C string is created. Documentation for UTF8String: _The returned C string is automatically freed just as a returned object would be released; you should copy the C string if it needs to store it outside of the autorelease context in which the C string is created._ – GoZoner Apr 17 '12 at 21:19
0

I don't think you can do this. NSString is an abstract interface to a multitude of classes that make no guarantees about the internal storage of the character data, so it's entirely possible there is no character array to get a pointer to.

If neither of the options mentioned in your question are suitable for your app, I'd recommend either creating your own string class for this purpose, or using raw malloc'ed unichar arrays instead of string objects.

grahamparks
  • 16,130
  • 5
  • 49
  • 43