I need to change a plain text UTF8 document from a R to L language to a Latin language. It isn't as easy as a character-character transliteration unfortunately.
For example, the "a" in the R to L language (ا) can be either "a" or "ә" depending on the word composition.
In words with a g, k, e, or hamza (گ،ك،ە، ء)
I need to change all the a, o, i, u (ا،و،ى،ۇ) to Latin ә, ѳ, i, ü (called "soft" vowels).
eg. سالەم becomes sәlêm, ءۇي becomes üy, سوزمەن becomes sѳzmên
In words without a g, k, e, or hamza (گ،ك،ە، ء)
the a, o, i, u change to Latin characters, a, o, i, u (called "hard" vowels).
eg. الما becomes alma, ۇل becomes ul, ورتا becomes orta.
In essence,
the g, k, e, or hamza act as a pronounciation guide in the arabic script.
In Latin, then I need two different sets of vowels depending on the original word in the arabic script.
I was thinking I might need to do the "soft" vowel words as step one, then do a separate Find and Replace on the rest of the document. BUT, how do I conduct a Find and Replace like this anyway with perl, or python?
Here is a unicode example: \U+0633\U+0627\U+0644\U+06D5\U+0645 \U+0648\U+0631\U+062A\U+0627 \U+0674\U+06C7\U+064A \U+0633\U+0648\U+0632\U+0645\U+06D5\U+0645 \U+0627\U+0644\U+0645\U+0627 \U+06C7\U+0644 \U+0645\U+06D5\U+0646\U+0649\U+06AD \U+0627\U+062A\U+0649\U+0645 \U+0634\U+0627\U+0644\U+0642\U+0627\U+0631.
It should come out looking like: "sәlêm orta üy sѳzmên alma ul mêning atim xalқar".(NOTE: the letter ڭ, which is U+06AD actually ends up as two letters, n+g, to make an "-ng" sound). It shouldn't look like "salêm orta uy sozmên alma ul mêning atim xalқar", nor "sәlêm ѳrtә üy sѳzmên әlmә ül mêning әtim xәlқәr".
Much thanks to any help.