2

I am exploring how to use CFStringTransform to transliterate texts in Hebrew and I am stuck with a few inconsistencies in which letters that should be pronounced differently are written in the exact same way or special cases that are not taken into account by Apple's algorithm.

Kaf (כּ → K) vs Khaf (כ → Ḵ)

  • כִּי ("because")

    let string = NSMutableString(string: "כִּי")
    CFStringTransform(string, nil, kCFStringTransformLatinHebrew, true)
    print(string) // prints "ki̇y"
    
  • שָׁכָחְתִּי ("I forgot")

    let string = NSMutableString(string: "שָׁכָחְתִּי")
    CFStringTransform(string, nil, kCFStringTransformLatinHebrew, true)
    print(string) // prints "şá̌káẖĕţi̇y" instead of "şá̌ḵáẖĕţi̇y"
    

While the kaf in כִּי is pronounced like a K in English, the khaf in שָׁכָֽחְתִּי is pronounced as in loch or Bach and it's typically transliterated as CH, KH or Ḵ. However, both letters are transliterated as K.

Pei (פּ → P) vs Fei (פ → F)

  • פַּרְעֹה ("pharaoh")

    let string = NSMutableString(string: "פַּרְעֹה")
    CFStringTransform(string, nil, kCFStringTransformLatinHebrew, true)
    print(string) // prints "pȧrĕʻòh"
    
  • יוֹסֵף ("Joseph")

    let string = NSMutableString(string: "יוֹסֵף")
    CFStringTransform(string, nil, kCFStringTransformLatinHebrew, true)
    print(string) // prints "ywòsép" instead of "ywòséf"
    

While the pei in פַּרְעֹה is pronounced like a P would be pronounced in English (and transliterated accordingly), the (trailing) fei in יוֹסֵף is pronounced like an F (and transliterated accordingly). However, both are transliterated with a P.

Trailing consonants with pataḥ g'nuva

From the article on Hebrew vocalization in the English Wikipedia:

A patach on a letters ח, ע, ה at the end of a word is sounded before the letter, and not after. Thus, נֹחַ (Noah) is pronounced /ˈno.ax/. This only occurs at the ends of words and only with patach and ח, ע, and הּ (that is, ה with a dot (mappiq) in it). This is sometimes called a patach ganuv, or "stolen" patach (more formally, "furtive patach"), since the sound "steals" an imaginary epenthetic consonant to make the extra syllable.

However:

  • תַפּוּחַ ("apple")

    let string = NSMutableString(string: "תַפּוּחַ")
    CFStringTransform(string, nil, kCFStringTransformLatinHebrew, true)
    print(string) // prints "ţaṗẇẖa" instead of "ţaṗẇaẖ"
    

Q: How can I change the behavior of CFStringTransform to account for these three cases?

From the reference for CFMutableString, we see that CFStringTransform takes as the transform: parameter

A CFString object that identifies the transformation to apply. For a list of valid values, see Transform Identifiers for CFStringTransform. On OS X v10.4 and later, you can also use any valid ICU transform ID defined in the ICU User Guide for Transforms.

From the documentation it would sound like the rules for ICU transforms are flexible enough that they can be customized. There is even a rule editor that can be accessed from their playground, but, while I have found a Stack Overflow question that deals with something tangentially similar, I cannot find a clearly documented way of doing it for RTL languages.

Community
  • 1
  • 1
catalandres
  • 1,149
  • 8
  • 20

1 Answers1

1

As far as I can tell through using it in Chinese and Japanese, CFStringTransform works token-by-token rather than taking the word as a whole (which may be multiple tokens long) into account. Thus, transliteration will be limited/incorrect when multiple tokens need to be taken into account. I hope I am right in presuming that this is the root of the problem for your issues in Hebrew as well (please advise).

I have found transliteration to be much more accurate in such cases by using a CFStringTokenizer to tokenise a text, then getting the Latin transliteration of each token in turn with the CFStringTokenizerCopyCurrentTokenAttribute() method. An example of this – I think in Objective C, although the syntax in this case is quite similar to Swift – is given here.

Community
  • 1
  • 1
Jamie Birch
  • 5,839
  • 1
  • 46
  • 60