Efficient data structure/algorithm for transliteration based word lookup

Question

I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.

For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.

There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?

Thanks in advance.

Edit: If I understand correctly, this is suggested by @Dialecticus-

I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.

Lets assume phonetically here is our associative arrays/transliteration table:

a -> p, q
b -> r
c -> x, y, z

We also have a valid word lists in plain arrays for Language2:

...
px
qy
...

If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.

What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.

I'm eager to know if there is any better method than that.

Just to be clear, will the end result of transliteration always have to be made up of words from the list of valid words? — MAK, Sep 24 '11 at 18:43
@MAK, Preferably yes. There is no point suggesting a word that doesn't make sense. — Mehdi, Sep 24 '11 at 18:52

MAK · Accepted Answer · 2011-09-24T19:59:13.710

From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.

One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have

px->ac
qy->ac
pr->ab

(I have added an extra valid word pr for clarity)

Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.

Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.

Thus for our example, the Trie would look something like this

                                  Root (empty)
                                    | a
                                    |
                                    V
                              +---Node (empty)---+
                              | b                | c
                              |                  |
                              V                  V
                           Node (px,qy)         Node (pr)

Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.

Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.

Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.

Here's a quick hack at a solution (in Java):

import java.util.*;

class TrieNode{
    // child nodes - size of array depends on your alphabet size,
    // her we are only using the lowercase English characters 'a'-'z'
    TrieNode[] next=new TrieNode[26];
    List<String> words;

    public TrieNode(){
        words=new ArrayList<String>();
    }
}

class Trie{
    private TrieNode root=null;

    public void addWord(String sourceLanguage, String targetLanguage){
        root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
    }

    private static int convertToIndex(char c){ // you need to change this for your alphabet
        return (c-'a');
    }

    private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
        if (cur==null){
            cur=new TrieNode();
        }
        if (s.length==pos){
            cur.words.add(targ);
        }
        else{

            cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
        }
        return cur;
    }

    public List<String> findMatches(String text){
        return find(root,text.toCharArray(),0);

    }

    private List<String> find(TrieNode cur, char[] s, int pos){
        if (cur==null) return new ArrayList<String>();
        else if (pos==s.length){
            return cur.words;
        }
        else{
            return find(cur.next[convertToIndex(s[pos])],s,pos+1);
        }
    }
}

class MyMiniTransliiterator{
    public static void main(String args[]){
        Trie t=new Trie();
        t.addWord("ac","px");
        t.addWord("ac","qy");
        t.addWord("ab","pr");

        System.out.println(t.findMatches("ac")); // prints [px,qy]
        System.out.println(t.findMatches("ab")); // prints [pr]
        System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
    }
}

This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.

**+1** for taking time to write such great response! I was hoping that someone would write about Trie. Haven't tested it yet, any rough estimation how much memory Trie may need if the valid word list has around 100,000 items? — Mehdi, Sep 24 '11 at 19:44
@MHK: This really depends on what the words are. Many words will end up having common prefixes, so that would mean a lot fewer nodes than worst case for that many words. The amount of prefix commonality of course depends on your dictionary. There are strategies for reducing the amount of memory used up in a Trie (e.g. by collapsing long sequences of non-branching nodes). You can see the section on this in the wikipedia article or a good algorithms/data structure book (AFAIR, there is a very good explanation in "Algorithms" by Sedgewick). — MAK, Sep 24 '11 at 19:55
@Rifat: From what I understood of your earlier comment, you want to store the child nodes of a node in a hashtable (key is a symbol in the input language and value is a child node) instead of having a fixed size array of Nodes. That works and will probably be better if the input language alphabet is large. But keep in mind a full blown general purpose hashtable (e.g. Java's `HashMap` or Python's `dict`) usually have a lot of overhead associated with them and having that many hastables might not get you a net win if the data in each of them is not a lot. — MAK, Sep 25 '11 at 11:26

score 2 · Answer 2 · answered Sep 24 '11 at 08:20

2

I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.

Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's std::unordered_map, and in C# you would use Dictionary.

answered Sep 24 '11 at 08:20

Dialecticus

16,400
7
43
103

1

I thought that, but the problem is, the combined result of transliterating one symbol at a time in most cases generate invalid/incorrectly spelled words. You have to check them against your valid word list anyway. – Mehdi Sep 24 '11 at 08:57
@Mehdi please edit your question to give us some examples of transliteration (t13n) where t13n-by-symbol gives incorrect results, together with the results of ideal t13n. – Dialecticus Sep 24 '11 at 14:02
Thank you for your response. I've written a simplified version of the problem in the edited question. – Mehdi Sep 24 '11 at 17:25

Efficient data structure/algorithm for transliteration based word lookup

2 Answers2