I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by @Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c
to Language2, which has 6 characters p,q,r,x,y,z
. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac
, the possible combinations become px, py, pz, qx, qy, qz
after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px
and qy
.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz]
and matching that with my valid word list, which provides the output px
and qy
.
I'm eager to know if there is any better method than that.