2

I'm converting Russian (or any language) string to a good looking Latin string to use in URL like example.com/obezd-pedestala

I use this code:

CFMutableStringRef bufferRef = (__bridge CFMutableStringRef)buffer;
CFStringTransform(bufferRef, NULL, kCFStringTransformToLatin, false);
CFStringTransform(bufferRef, NULL, kCFStringTransformStripCombiningMarks, false);
CFStringTransform(bufferRef, NULL, kCFStringTransformStripDiacritics, false);

If I pas string like buffer Объезд пьедестала, I get Obʺezd pʹedestala. Letter ъ is replaced by ʺ and ь is replaced by ʹ.

I can use stringByAddingPercentEscapesUsingEncoding to get a valid URL of course, but this is not a good looking URL I want.

How can I remove all those quotes and god knows what else characters from resulting string?

ksoftware
  • 137
  • 1
  • 9

1 Answers1

5

The docs for CFStringTransform() note that it can take "any valid ICU transform ID defined in the ICU User Guide for Transforms". From that and a bit of knowledge about Unicode categories, I came up with the following, which will strip such odd characters from the string:

CFStringTransform(bufferRef, NULL, CFSTR("[^[:Latin:][:space:][:number:]] Remove"), false);

Apparently, kCFStringTransformToLatin does not leave only characters in the Latin category. The above transform removes any character which is not in the union of the Latin, space, and number categories. You could customize that further with different character sets if you have different needs.

Ken Thomases
  • 88,520
  • 7
  • 116
  • 154
  • Wow, `CFStringTransform` is really powerful. Thank you. Is there a list of available transliterator identifiers? – ksoftware Dec 30 '14 at 05:34
  • How can I leave numbers in my string for example? – ksoftware Dec 30 '14 at 05:48
  • Oh. I originally had a solution to leave numbers, but I reworked things and forgot that. I'll edit my answer. – Ken Thomases Dec 30 '14 at 07:11
  • 1
    In my answer, the only "transliterator" is `Remove`. For others, see [here](http://userguide.icu-project.org/transforms/general#TOC-ICU-Transliterators) (includes both the general transforms and the script transforms). The other part, in the brackets, is a filter controlling which characters are affected. I'm using [Unicode character properties, including categories](https://en.wikipedia.org/wiki/Unicode_character_property). – Ken Thomases Dec 30 '14 at 07:24
  • Have you managed to get list of available transliterators IDs using Corefoundation API? Since some of transliterator usages expected an exception before passing string to transliterator engine, that could be a way of validate ID – ugene Jul 09 '15 at 14:44
  • @ugene, I don't understand. You're getting exceptions when you use certain transliterator IDs which are listed in the Unicode docs? I don't know of another list. You wouldn't normally need to construct a transform dynamically, so just test what you want to use and find out if you can at development time. – Ken Thomases Jul 09 '15 at 15:31
  • Ken, thanks for replay, I was just questioning about list of available IDs. You are saying it is in Unicode docs? I probably can gen a list via icu4c library for dedicated dat file, but that could be changed amoung new IOS version. http://userguide.icu-project.org/transforms/general#TOC-Using-Transliterators One of examples of having that list, is posibility to check basic transliteration ID for validness before user puts any text he want to transliterate. Agree that ID is internal string and shouldn't be used to display to user since it isn't in user friendly locale. – ugene Jul 09 '15 at 16:00