I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.