26

I'm trying to compare names without any punctuation, spaces, accents etc. At the moment I am doing the following:

-(NSString*) prepareString:(NSString*)a {
    //remove any accents and punctuation;
    a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];

    a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
    a=[a lowercaseString];
    return a;
}

However, I need to do this for hundreds of strings and I need to make this more efficient. Any ideas?

dandan78
  • 13,328
  • 13
  • 64
  • 78

13 Answers13

81
NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
Peter N Lewis
  • 17,664
  • 2
  • 43
  • 56
  • 3
    Just logged the contents of the `letterCharacterSet` - it seams to contain accents - here is a 20 character snippet `opqrstuvwxyzªµºÀÁÂÃÄ` here is the code I used: https://gist.github.com/rsaunders100/6160147 – Robert Aug 05 '13 at 22:31
  • 2
    And in `Swift` because `componentsJoinedByString` does exist but differently: `let finish = "".join(start.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet))` – Aviel Gross Oct 04 '14 at 14:32
  • Excellent! I was comparing file names to strings and for example é fell through. The remedy is to create a set with only what you want: `let name = "".join(theString.componentsSeparatedByCharactersInSet(NSCharacterSet(charactersInString: "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM").invertedSet))` – Simpa May 17 '15 at 22:18
39

Before using any of these solutions, don't forget to use decomposedStringWithCanonicalMapping to decompose any accented letters. This will turn, for example, é (U+00E9) into e ‌́ (U+0065 U+0301). Then, when you strip out the non-alphanumeric characters, the unaccented letters will remain.

The reason why this is important is that you probably don't want, say, “dän” and “dün”* to be treated as the same. If you stripped out all accented letters, as some of these solutions may do, you'll end up with “dn”, so those strings will compare as equal.

So, you should decompose them first, so that you can strip the accents and leave the letters.

*Example from German. Thanks to Joris Weimar for providing it.

Peter Hosey
  • 95,783
  • 15
  • 211
  • 370
  • I think Peter is trying to demonstrate 2 words with the same letters and different accents. :-) – Quinn Taylor Aug 06 '09 at 13:34
  • Funny German example. :D It's not German (Danish is "dänisch" in German), but it's still a nice example for outlining the problem. http://dict.leo.org/#/search=Danish – Daniel S. Aug 27 '13 at 14:44
  • So the common misunderstanding in English is assuming that those are in fact the same letter with different accents. In English they are often perceived as such, but with the proper locale consideration those are different letters in other locales. That's the inherent problem with this question. It's a naive and wrong approach to sorting. – uchuugaka Dec 10 '13 at 06:00
15

On a similar question, Ole Begemann suggests using stringByFoldingWithOptions: and I believe this is the best solution here:

NSString *accentedString = @"ÁlgeBra";
NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];

Depending on the nature of the strings you want to convert, you might want to set a fixed locale (e.g. English) instead of using the user's current locale. That way, you can be sure to get the same results on every machine.

Community
  • 1
  • 1
Sophie Alpert
  • 139,698
  • 36
  • 220
  • 238
7

One important precision over the answer of BillyTheKid18756 (that was corrected by Luiz but it was not obvious in the explanation of the code):

DO NOT USE stringWithCString as a second step to remove accents, it can add unwanted characters at the end of your string as the NSData is not NULL-terminated (as stringWithCString expects it). Or use it and add an additional NULL byte to your NSData, like Luiz did in his code.

I think a simpler answer is to replace:

NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];

By:

NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

If I take back the code of BillyTheKid18756, here is the complete correct code:

// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";

// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];

// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
7

If you are trying to compare strings, use one of these methods. Don't try to change data.

- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale

You NEED to consider user locale to do things write with strings, particularly things like names. In most languages, characters like ä and å are not the same other than they look similar. They are inherently distinct characters with meaning distinct from others, but the actual rules and semantics are distinct to each locale.

The correct way to compare and sort strings is by considering the user's locale. Anything else is naive, wrong and very 1990's. Stop doing it.

If you are trying to pass data to a system that cannot support non-ASCII, well, this is just a wrong thing to do. Pass it as data blobs.

https://developer.apple.com/library/ios/documentation/cocoa/Conceptual/Strings/Articles/SearchingStrings.html

Plus normalizing your strings first (see Peter Hosey's post) precomposing or decomposing, basically pick a normalized form.

- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping

No, it's not nearly as simple and easy as we tend to think. Yes, it requires informed and careful decision making. (and a bit of non-English language experience helps)

uchuugaka
  • 12,679
  • 6
  • 37
  • 55
  • I totally agree. Simple replace or regex doesn't make sense if you know other languages. The code should never contain language specific characters directly like an array of characters to replace etc. if it is not natively supported, try to find a library. Fortunately, obj c comes with good support for localization. – Edgar Jan 23 '15 at 20:20
  • Some of the best language support in an API. – uchuugaka Jan 24 '15 at 05:07
4

Consider using the RegexKit framework. You could do something like:

NSString *searchString      = @"This is neat.";
NSString *regexString       = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString    = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];

NSLog (@"%@", replacedString);
//... Thisisneat
Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • How do I use regex to remove all punctuation without having several statements? I'm trying to avoid going over the string several times. –  Aug 05 '09 at 09:59
  • You only need to go over the original string once. The regex ("regular expression") removes all punctuation at once, replacing all non-alphanumeric characters with a blank (""). – Alex Reynolds Aug 05 '09 at 11:33
4

Consider using NSScanner, and specifically the methods -setCharactersToBeSkipped: (which accepts an NSCharacterSet) and -scanString:intoString: (which accepts a string and returns the scanned string by reference).

You may also want to couple this with -[NSString localizedCompare:], or perhaps -[NSString compare:options:] with the NSDiacriticInsensitiveSearch option. That could simplify having to remove/replace accents, so you can focus on removing puncuation, whitespace, etc.

If you must use an approach like you presented in your question, at least use an NSMutableString and replaceOccurrencesOfString:withString:options:range: — that will be much more efficient than creating tons of nearly-identical autoreleased strings. It could be that just reducing the number of allocations will boost performance "enough" for the time being.

Quinn Taylor
  • 44,553
  • 16
  • 113
  • 131
4

To give a complete example by combining the answers from Luiz and Peter, adding a few lines, you get the code below.

The code does the following:

  1. Creates a set of accepted characters
  2. Turn accented letters into normal letters
  3. Remove characters not in the set

Objective-C

// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";

// Create set of accepted characters
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];

// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];

// Remove characters not in the set
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];

Swift (2.2) example

let text = "BûvérÈ!@$&%^&(*^(_()-*/48"

// Create set of accepted characters
let acceptedCharacters = NSMutableCharacterSet()
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.letterCharacterSet())
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.decimalDigitCharacterSet())
acceptedCharacters.addCharactersInString(" _-.!")

// Turn accented letters into normal letters (optional)
let sanitizedData = text.dataUsingEncoding(NSASCIIStringEncoding, allowLossyConversion: true)
let sanitizedText = String(data: sanitizedData!, encoding: NSASCIIStringEncoding)

// Remove characters not in the set
let components = sanitizedText!.componentsSeparatedByCharactersInSet(acceptedCharacters.invertedSet)
let output = components.joinWithSeparator("")

Output

The output for both examples would be: BuverE!_-48

Vegard
  • 4,352
  • 1
  • 27
  • 25
3

Just bumped into this, maybe its too late, but here is what worked for me:

// text is the input string, and this just removes accents from the letters

// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
                                  allowLossyConversion:YES];

// increase length by 1 adds a 0 byte (increaseLengthBy 
// guarantees to fill the new space with 0s), effectively turning 
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];

// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];
  • Just a note that this does work, but with one minor tweak: `dataUsingEncoding` returns NSData, not NSMutableData, so you have to do `[[[text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] mutableCopy] autorelease]` – Matt Rix Aug 19 '11 at 20:04
  • This also will remove all non-ASCII letters like in 'жопень' – Mike Keskinov Jun 05 '12 at 17:44
  • Awesome! You made my day man. Since stringWithCString is deprecated, you must use stringWithCString:encoding instead. I used NSASCIIStringEncoding as well and it worked fine! – DZenBot Aug 30 '12 at 17:46
  • [sanitizedData increaseLengthBy:1]; is crashing the app – Ilker Baltaci Oct 18 '12 at 08:20
1
@interface NSString (Filtering)
    - (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet;
@end

@implementation NSString (Filtering)
    - (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet {
      NSMutableString * mutString = [NSMutableString stringWithCapacity:[self length]];
      for (int i = 0; i < [self length]; i++){
        char c = [self characterAtIndex:i];
        if(![charSet characterIsMember:c]) [mutString appendFormat:@"%c", c];
      }
      return [NSString stringWithString:mutString];
    }
@end
lorean
  • 2,150
  • 19
  • 25
  • I like your answer, but I adapted it to work a little differently, with a string of allowed characters instead of a disallowed character set. – ElmerCat Jan 10 '15 at 05:10
1

These answers didn't work as expected for me. Specifically, decomposedStringWithCanonicalMapping didn't strip accents/umlauts as I'd expected.

Here's a variation on what I used that answers the brief:

// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: @"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;
Tricky
  • 7,025
  • 5
  • 33
  • 43
0

Peter's Solution in Swift:

let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")

Example:

let oldString = "Jo_ - h !. nn y"
// "Jo_ - h !. nn y"
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet)
// ["Jo", "h", "nn", "y"]
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
// "Johnny"
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Babac
  • 931
  • 11
  • 21
-1

I wanted to filter out everything except letters and numbers, so I adapted Lorean's implementation of a Category on NSString to work a little different. In this example, you specify a string with only the characters you want to keep, and everything else is filtered out:

@interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
@end


@implementation NSString (PraxCategories)

+ (NSString *)lettersAndNumbers { return @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }

- (NSString*)stringByKeepingOnlyLettersAndNumbers {
    return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}

- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
    NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
    NSMutableString * mutableString = @"".mutableCopy;
    for (int i = 0; i < [self length]; i++){
        char character = [self characterAtIndex:i];
        if([characterSet characterIsMember:character]) [mutableString appendFormat:@"%c", character];
    }
    return mutableString.copy;
}

@end

Once you've made your Categories, using them is trivial, and you can use them on any NSString:

NSString *string = someStringValueThatYouWantToFilter;

string = [string stringByKeepingOnlyLettersAndNumbers];

Or, for example, if you wanted to get rid of everything except vowels:

string = [string stringByKeepingOnlyCharactersInString:@"aeiouAEIOU"];

If you're still learning Objective-C and aren't using Categories, I encourage you to try them out. They're the best place to put things like this because it gives more functionality to all objects of the class you Categorize.

Categories simplify and encapsulate the code you're adding, making it easy to reuse on all of your projects. It's a great feature of Objective-C!

ElmerCat
  • 3,126
  • 1
  • 26
  • 34