7

I have multiple .txt files containing text in an alphabet; I want to transliterate the text into an other alphabet; some characters of alphabet1 are 1:1 with those of alphabet2 (i.e. a becomes e), whereas others are 1:2 (i.e. x becomes ch).

I would like to do this using a simple script for the Linux shell.

With tr or sed I can convert 1:1 characters:

sed -f y/abcdefghijklmnopqrstuvwxyz/nopqrstuvwxyzabcdefghijklm/

a will become n, b will become o et cetera (a Caesar's cipher, I think)

But how can I deal with 1:2 characters?

4 Answers4

5

Using Awk:

#!/usr/bin/awk -f
BEGIN {
    FS = OFS = ""
    table["a"] = "e"
    table["x"] = "ch"
    # and so on...
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in table) {
            $i = table[$i]
        }
    }
}
1

Usage:

awk -f script.awk file

Test:

# echo "the quick brown fox jumps over the lazy dog" | awk -f script.awk
the quick brown foch jumps over the lezy dog
konsolebox
  • 72,135
  • 12
  • 99
  • 105
  • 1
    Perfect! Thank's very much! –  Aug 16 '14 at 09:23
  • 1
    +1 but rather than populating the table explicitly, do this to save some redundant coding: `split("a e x ch ...",t,/ /); for (i=1; i in t; i+=2) table[t[i]] = t[i+1]`. – Ed Morton Aug 17 '14 at 06:24
  • @EdMorton : thank's, but I couldn't make it work; and, however, I actually _like_ the idea of populating the table explicitly (see my comment to @TomFenech) –  Aug 17 '14 at 11:00
  • @mus_siluanus if you tell us in what way you "couldn't make it work" we can help you. Even if you don't use this now, it is the common awk idiom for populating arrays with initial values so you probably will want to do it at some point. If you prefer you can have 2 arrays populated one about the other. I'll add an answer so I can show you how that looks formatted. – Ed Morton Aug 17 '14 at 14:17
5

Not an answer, just to show a briefer, idiomatic way to populate the table[] array from @konsolebox's answer as discussed in the related comments:

BEGIN {
    split("a  e b", old)
    split("x ch o", new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].

You can even create a general script to do mappings and just pass in the old and new strings as variables:

BEGIN {
    split(o, old)
    split(n, new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

then in shell anything like this:

old="a  e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file

and you can protect yourself from your own mistakes populating the strings, e.g.:

BEGIN {
    numOld = split(o, old)
    numNew = split(n, new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        table[old[i]] = new[i]
    }
}

Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.

Here's one complete solution as discussed in the comments below

BEGIN {
    numOld = split("a  e b", old)
    numNew = split("x ch o", new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        map[old[i]] = new[i]
    }

    FS = OFS = ""
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in map) {
            $i = map[$i]
        }
    }
    print
}

I renamed the table array as map just because iMHO that better represents the purpose of the array.

save the above in a file script.awk and run it as awk -f script.awk inputfile

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • I tried your codes again but they give no output; maybe I miss something? What I did: copied the code in a new file called script.awk; run the script as suggested. I get neither errors nor output. –  Aug 17 '14 at 16:54
  • I just showed how to populate the mapping table differently, you still need the rest of the script @konsolebox posted to actually do something with that mapping. Hang on and I'll update it with a complete solution. – Ed Morton Aug 17 '14 at 16:57
  • Now it outputs the same text of input. I copied your new code in a new file, then in the shell I did: echo "ae" | awk -f script.awk. Output was: ae –  Aug 17 '14 at 17:16
  • I forgot to add in the setting of FS and OFS when I put together the complete solution, updated now. – Ed Morton Aug 17 '14 at 17:29
  • 1
    Now it works! Thank'you very much; I like its ability to search for errors –  Aug 17 '14 at 18:28
2

This can be done quite concisely using a Perl one-liner:

perl -pe '%h=(a=>"xy",c=>"z"); s/(.)/defined $h{$1} ? $h{$1} : $1/eg'

or equivalently (thanks jaypal):

perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg'

%h is a hash containing the characters (keys) and their substitutions (values). s is the substitution command (as in sed). The g modifier means that the substitution is global and the e means that the replacement part is evaluated as an expression. It captures each character one by one and substitutes them with the value in the hash if it exists, otherwise keeps the original value. The -p switch means that each line in the input is automatically printed.

Testing it out:

$ perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg' <<<"abc"
xybz
Community
  • 1
  • 1
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • Thank'you very much! I like the idea of using a one-liner. But I prefer @konsolebox 's script because for long lists of substitutions (as in transliterations) his approach would give a cleaner view of what I'll do... sort of a beautiful embedded character map... –  Aug 17 '14 at 10:49
  • @glenn thanks for the edit - I assume that the double quote in the middle of `a=">xy"` was a typo? It seemed to be working in the first instance, which I guess is just a symptom of using a one-liner. – Tom Fenech Aug 17 '14 at 17:43
  • Exactly for both points. With `use strict`, one would see `Bareword "z" not allowed while "strict subs" in use` – glenn jackman Aug 17 '14 at 17:53
  • 1
    @TomFenech Can be reduced to `perl -pe'%h=(a=>"xy",b=>"z");s|(.)|$h{$1}//=$1|eg' <<<"abc"`. [//=](http://perldoc.perl.org/perlop.html#Logical-Defined-Or) was introduced after 5.8 so should work unless using ancient `perl`. – jaypal singh Aug 17 '14 at 19:17
1

Using sed.

Write a file transliterate.sed containing:

s/a/e/g
s/x/ch/g

and then run from your command line to get the transliterated output.txt from input.txt:

sed -f transliterate.sed input.txt > output.txt

If you need this more often consider adding #!/bin/sed -f as first line and making your file executable with chmod 744 transliterate.sed as described at the Wikipedia page for sed.

asdf
  • 121
  • 4