9

I need to change a plain text UTF8 document from a R to L language to a Latin language. It isn't as easy as a character-character transliteration unfortunately.
For example, the "a" in the R to L language (ا) can be either "a" or "ә" depending on the word composition.

In words with a g, k, e, or hamza (گ،ك،ە، ء)
I need to change all the a, o, i, u (ا،و،ى،ۇ) to Latin ә, ѳ, i, ü (called "soft" vowels).
eg. سالەم becomes sәlêm, ءۇي becomes üy, سوزمەن becomes sѳzmên

In words without a g, k, e, or hamza (گ،ك،ە، ء)
the a, o, i, u change to Latin characters, a, o, i, u (called "hard" vowels).
eg. الما becomes alma, ۇل becomes ul, ورتا becomes orta.

In essence,
the g, k, e, or hamza act as a pronounciation guide in the arabic script.
In Latin, then I need two different sets of vowels depending on the original word in the arabic script.

I was thinking I might need to do the "soft" vowel words as step one, then do a separate Find and Replace on the rest of the document. BUT, how do I conduct a Find and Replace like this anyway with perl, or python?

Here is a unicode example: \U+0633\U+0627\U+0644\U+06D5\U+0645 \U+0648\U+0631\U+062A\U+0627 \U+0674\U+06C7\U+064A \U+0633\U+0648\U+0632\U+0645\U+06D5\U+0645 \U+0627\U+0644\U+0645\U+0627 \U+06C7\U+0644 \U+0645\U+06D5\U+0646\U+0649\U+06AD \U+0627\U+062A\U+0649\U+0645 \U+0634\U+0627\U+0644\U+0642\U+0627\U+0631.

It should come out looking like: "sәlêm orta üy sѳzmên alma ul mêning atim xalқar".(NOTE: the letter ڭ, which is U+06AD actually ends up as two letters, n+g, to make an "-ng" sound). It shouldn't look like "salêm orta uy sozmên alma ul mêning atim xalқar", nor "sәlêm ѳrtә üy sѳzmên әlmә ül mêning әtim xәlқәr".

Much thanks to any help.

Shane
  • 223
  • 1
  • 11
  • 1
    Have you tried using regular expressions ? – Benjamin Toueg Jan 30 '13 at 10:07
  • I have been trying regular expressions (regex, correct?), but for the life of me I can't figure out how to structure the query properly. In this sense regex, or perl, or python, or even some other solution, would be great. – Shane Jan 30 '13 at 10:11
  • A small snippet composed of the words in the question (plus some extras): .سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار – Shane Jan 30 '13 at 10:14
  • 1
    Sorry, I can't even make out the different characters... – nhahtdh Jan 30 '13 at 10:17
  • 1
    Colud you give us the unicode string for it like `u'\x80'` – pradyunsg Jan 30 '13 at 10:20
  • Sure, but how should I format it? Straight hex would just be U+06C7U+0644, or would that be \U+06C7\U+0644 You @padilla seem to ask for python formatting, which would be u"\u06C7"u"\u0644". – Shane Jan 30 '13 at 10:59
  • Simply put, all you need is to evaluate the consonants before the vowels and translate the vowels accordingly. Is that correct? – inhan Jan 30 '13 at 11:11
  • @inhan, close, but some words don't have a consonant before the vowels, some are only vowels marked by a hamza. – Shane Jan 30 '13 at 11:50
  • But still, there aren't 100s of possibilities, I guess..? – inhan Jan 30 '13 at 11:54
  • no, not 100s. There are a few exceptions (under 15), but the gist of it is, if there is a g, k, e or hamza in the word then all vowels are "soft". Otherwise, they are "hard". The challenge is that a given vowel in the arabic script can be either a soft or hard vowel in a Latin or Cyrillic script. – Shane Jan 30 '13 at 12:47
  • What is the translation of "سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار" to English? – jfs Jan 30 '13 at 14:20
  • @J.F.Sebastian : Doesn't translate to anything meaningful... at least I can't make sense of any word. I'd say that someone's randomly jabbed at the keyboard. – Zaid Jan 30 '13 at 17:24
  • @J.F.Sebastian The example isn't a proper sentence, just random words chosen for the spelling to be examples. But, word for word, "Greetings middle home 'with a word' apple son 'my name is Infinity'" – Shane Jan 30 '13 at 17:36

4 Answers4

4

You can build your own translation table with ordinal mapping to substitute characters, for each set of chars, you would need a separate table (for vowels). This is only a partial example, but should give you an idea how to do it.


Note that you would need to specify the translation table for other chars. You can also translate one arabic char to multiple latin ones if it's needed. If you compare the output to your request, it seems that all chars in the translation table match correctly.

import re

s1 = {u'ء',u'ە',u'ك',u'گ'} # g, k, e, hamza

t1 = {ord(u'ا'):u'ә',  # first case
      ord(u'و'):u'ѳ',
      ord(u'ى'):u'i',
      ord(u'ۇ'):u'ü',
      ord(u'ڭ'):u'ng'} # with double

t2 = {ord(u'ا'):u'a',  # second case
      ord(u'و'):u'o',
      ord(u'ى'):u'i',
      ord(u'ۇ'):u'u',
      ord(u'ڭ'):u'ng'} # with double

def subst(word):    
    if any(c in s1 for c in word):
        return word.translate(t1)
    else:
        return word.translate(t2)

s = u'سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار'

print re.sub(ur'(\S+)', lambda m: subst(m.group(1)), s)

# output:    سәلەم oرتa ءüي سѳزمەن aلمa uل مەنing aتiم شaلقaر

# requested: sәlêm orta üy sѳzmên alma ul mêning atim xalқar
root
  • 76,608
  • 25
  • 108
  • 120
  • How would I use this? Do I put it into a file, name it, and run it from the terminal (I am on a Mac)? – Shane Jan 30 '13 at 11:47
  • @Shane -- After you extend it to a fully functional program, yes. But I think it would need quite some additional work. – root Jan 30 '13 at 11:51
4

Command:

$ echo سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار | ./arabic-to-latin

Output:

sәlêm orta üy sѳzmên alma ul mêning atim xalқar

To use files instead of stdin/stdout:

$ ./arabic-to-latin input_file_with_arabic_text_in_utf8 >output_latin_in_utf8

Where arabic-to-latin file:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
#XXX normalization

sub replace_word {
    my ($word) = @_;
    $_ = $word;
    if (/ء|ە|ك|گ/) { # g, k, e, or hamza in the word
        tr/اوىۇ/әѳiü/; # soft
    } else {
        tr/اوىۇ/aoiu/; # hard
    }
    tr/سلەمرتزنشق/slêmrtznxқ/;
    s/ءüي/üy/g;
    s/ڭ/ng/g;
    $_;
}

while (my $line = <>) {
    $line =~ s/(\w+)/replace_word($1)/ge;
    print $line;
}

To make arabic-to-latin file executable:

$ chmod +x ./arabic-to-latin
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Amazing. This one did the trick. Thank you for your help! I can even tweak as need be! – Shane Jan 30 '13 at 14:34
  • To convert punctuation, would I place that in a manner like s/ءüي/üy/g;, for example, s/،/,/g; s/؟/?/g; s/؛/;/g; And how to capitalize a word at the start of a sentence? MUCH thanks! – Shane Jan 30 '13 at 14:54
  • @Shane: `replace_word()` never receives punctuation. You could put `$line =~ s/from/to/g;` before `print $line;` in this case. Unicode makes even simple tasks such as capitalizations complicated. Splitting a text into sentences depends on language and how precise you'd like to be. You could try: [`$formatted = autoformat $rawtext, { case => 'sentence' };`](http://search.cpan.org/~dconway/Text-Autoformat-1.669002/lib/Text/Autoformat.pm) – jfs Jan 30 '13 at 16:15
  • I see. What you've offered is excellent, and a great base for me to jump off into learning more about this. At the moment it isn't a big deal to just open the converted file in Mellel and do a Find and Replace on punctuation and such. Thank you. – Shane Jan 30 '13 at 16:56
  • Does `replace_word $1` work here? It should work but then again the syntax may be different in the replace expression. – Nissa Sep 22 '16 at 14:42
  • @SomePerson that syntax would obscure the fact that it is a function call. Perl sometimes allows to omit parentheses for a function call but it would hurt readability here (I don't know whether it even works here—you could try but it is not a good idea to use that syntax here). – jfs Sep 22 '16 at 17:20
0

I don't speak perl or python (or arabic), but this is the basic idea you could use (using Javascript, but it should be possible to translate to any language that has replace with callbacks):

//replace [a-z] with the proper unicode range for arabic
input.replace(/[a-z]+/, function(word){
  //replace `[gkeh]` with their arabic equivalents
  if(/[gkeh]/.test(word){
    return word.replace(/./, function(c){
      return withSoftVowels[c]
    })
  }else{
    return word.replace(/./, function(c){
      return withHardVowels[c]
    })
  }
})

That is, split the input to words, then replace each symbol within that word using one of two translation tables based on whether that word contains a specific character. A regex can be used for both, or you can split by word boundaries and do a replacement within words (while using an equivalent of indexOf for branching).

Here's the approach without callbacks (if strings in Javascript were mutable):

var words = input.split(' ');
var table;
for(var i=0; i<words.length; i++){
  if(words[i].test([gkeh]){
    table = softTable;
  }else{
    table = hardTable;
  }
  for(var j=0; j<words[i].length; j++){
    if(words[i][j] in table){
      words[i][j]=table[words[i][j]];
    }
  }
}
return words.join(' ');
John Dvorak
  • 26,799
  • 13
  • 69
  • 83
  • Python does not have callbacks but you can simply do a `text.split()` to obtain a sequence of words and then iterate over the sequence. – Bakuriu Jan 30 '13 at 10:45
  • 2
    @Bakuriu Python does have callbacks. It even works for the [`re.sub`](http://docs.python.org/2/library/re.html#re.sub) function. – bikeshedder Jan 30 '13 at 10:49
  • How would I use this? Do I put it into a file, name it, and run it from the terminal (I am on a Mac)? – Shane Jan 30 '13 at 11:48
  • @Shane This is written in javascript. You need to translate this to either python or perl, add file reading/writing routines and do the indicated character class replacements, as well as prepare the `softTable` and `hardTable`. Only a person that knows python _and arabic_ would be able to write you a ready-made script including the conversion tables. My intention was to show you the logic. – John Dvorak Jan 30 '13 at 11:54
  • @JanDvorak I understand, much thanks. I can google for what you mention and see what I can develop. – Shane Jan 30 '13 at 12:01
0

This python code is based on the one from Jan Dvorak and should provide a starting point:

import re
import codecs

def replace_word(word):
    if re.search(ur'[gkeh]', word):
        # hard vowels
        word = word.replace(u'a', u'ә')
        word = word.replace(u'o', u'ѳ')
        word = word.replace(u'i', u'i')
        word = word.replace(u'u', u'ü')
    else:
        # soft vowels
        word = word.replace(u'a', u'a')
        word = word.replace(u'o', u'o')
        word = word.replace(u'i', u'i')
        word = word.replace(u'u', u'u')
    return word

with codecs.open('input.txt', 'w', 'utf-8') as fh:
    input = fh.read()

output = re.sub(ur'(\S+)', lambda m: replace_word(m.group(1)), input)

with codecs.open('output.txt', 'w', 'utf-8') as fh:
    fh.write(output)
bikeshedder
  • 7,337
  • 1
  • 23
  • 29
  • Would I copy this into a file name it convertor.py, then run "python convertor.py" from the terminal? How do I use this? – Shane Jan 30 '13 at 11:18
  • @Shane you still need to convert this to use arabic as a source and add replacements for consonants (and replace `[gkeh]` and the vowels with their arabic counterparts). – John Dvorak Jan 30 '13 at 11:58
  • @JanDvorak I was thinking that if this worked as a file I can run from the command line (with needed tweaks), then I would then open the document in a word processor like Mellel and do a Find and Replace for the rest of the characters, which would be character-character at that point. – Shane Jan 30 '13 at 12:03
  • @Shane If you want to transliterate first, then replace vowels, you can, if the information of what should be replaced isn't lost with the transliteration. – John Dvorak Jan 30 '13 at 12:06