0

I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.

First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:

cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }

Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.

Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:

def dia_search(word, cmap, dictionary):
    prefixes = ['']
    for c in word:
        # each character maps to itself
        subchars = [c]
        # and some diacritical siblings if they exist
        if cmap.has_key(c):
            subchars += cmap[c]
        # build a list of matching prefixes for the next round
        prefixes = [p+s for s in subchars
                        for p in prefixes
                        if dictionary.psearch(p+s)>0]
    return prefixes

This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.

Thanks for all comments.

EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.

plebuch
  • 11
  • 4

3 Answers3

1

using the standard library only (str.maketrans and str.translate) you could do this:

intab = "řížéě"  # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)

strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def 

this is for python3.

for python 2 you'd need to:

from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same
hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • This could be used to create a copy of my word list without diacritics. The thing is I cannot keep two copies, only the original one with diacritics. If you have a database that contains 3 million items then duplicating it is not a way to go. – plebuch Jan 29 '17 at 17:29
  • at some point you need to do this (or a similar) translation. be it to create an index in a database or something else... `translate` should be more efficient than just using a `dict`. (3 million items in a db should be perfectly handleable). – hiro protagonist Jan 29 '17 at 21:26
  • At which point do you mean? I presented a solution that works with the original copy and my question was - is there anything more efficient? Duplicating a database of a size of hundreds of MB is something I don't see as more efficient. Another flaw about `translate` is it wouldn't allow me to map German 'ß' to 'ss', or would it? Anyway, `translate` vs `dict` isn't really relevant here because it's used only to hold the mapping which is just a few characters in almost every alphabet that uses latin. In case of Czech it's only 15 diacritic characters that need to be mapped. – plebuch Jan 29 '17 at 23:07
  • you talk about 'database'. maybe you should clarify: what kind of db? would [this answer](http://stackoverflow.com/a/3304596/4954037) help? some dbs can take care of your issue. and you are correct: `translate` can not handle things like 'ß' (one char into two)... – hiro protagonist Jan 30 '17 at 07:35
  • but you are right: my solution does not exactly do what you like and there is probably no remedy for that... – hiro protagonist Jan 30 '17 at 07:48
  • I guess I should have been more specific about the 'database' of words I'm using. I used the term 'database' for something general that stores items, not relating to an actual relational database model. In my case it's a class that has all items (strings in my case) stored in a list and provides an interface for looking up entries. As I mentioned in the original question - it implements a variant of binary search for quick lookups. Using an actual sql database here would be a huge overkill in my case. – plebuch Jan 30 '17 at 10:17
0

Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')

Harry
  • 320
  • 1
  • 9
0

As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.

One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.

Or in case you can alter the original list, you can translate everything in-place and strip duplicates.

Pablo Arias
  • 360
  • 1
  • 9