4

How can I add accent-insensitive search to following snippet from the django docs:

>>> from django.contrib.postgres.search import TrigramSimilarity
>>> Author.objects.create(name='Katy Stevens')
>>> Author.objects.create(name='Stephen Keats')
>>> test = 'Katie Stephens'
>>> Author.objects.annotate(
...     similarity=TrigramSimilarity('name', test),
... ).filter(similarity__gt=0.3).order_by('-similarity')
[<Author: Katy Stevens>, <Author: Stephen Keats>]

How could this match test = 'Kâtié Stéphèns'?

Private
  • 2,626
  • 1
  • 22
  • 39

3 Answers3

9

There exist the unaccent lookup:

The unaccent lookup allows you to perform accent-insensitive lookups using a dedicated PostgreSQL extension.

Also if you take a look at the aggregation part of django docs, you can read the following:

When specifying the field to be aggregated in an aggregate function, Django will allow you to use the same double underscore notation that is used when referring to related fields in filters. Django will then handle any table joins that are required to retrieve and aggregate the related value.


Derived from the above:

You can use the trigram_similar lookup, combined with unaccent, then annotate on the result:

Author.objects.filter(
    name__unaccent__trigram_similar=test
).annotate(
    similarity=TrigramSimilarity('name__unaccent', test),
).filter(similarity__gt=0.3).order_by('-similarity')

OR

if you want to keep it as close as possible to the original sample (and omit one potentially slow filtering followed by another):

Author.objects.annotate(
    similarity=TrigramSimilarity('name__unaccent', test),
).filter(similarity__gt=0.3).order_by('-similarity')

Those will only work in Django version >= 1.10


EDIT:

Although the above should work, @Private reports this error occurred:

Cannot resolve keyword 'unaccent' into a field. Join on 'unaccented' not permitted.

This may be a bug, or unaccent is not intended to work that way. The following code works without the error:

Author.objects.filter(
    name__unaccent__trigram_similar=test
).annotate(
    similarity=TrigramSimilarity('name', test),
).filter(similarity__gt=0.3).order_by('-similarity')
John Moutafis
  • 22,254
  • 11
  • 68
  • 112
  • That makes sense, but could you show me how to use that in the sample code in my question? – Private May 08 '17 at 10:18
  • Thanks, the example in the docs, however, does not start with filter, but with annotate: `objects.annotate( ... similarity=TrigramSimilarity('name', test), ... ).filter(similarity__gt=0.3).order_by('-similarity')` – Private May 08 '17 at 10:29
  • Yes but the example on the docs is general, and your problem is more specific, the code sample I provided above, does what you want! – John Moutafis May 08 '17 at 10:32
  • Not completely: this only works in one way. If you have a search that is unaccented, it will not find its accented relatives. – Private May 08 '17 at 10:55
  • To be more precise: `'name'` is used in the similarity scoring, but not in its unaccented form. – Private May 08 '17 at 11:03
  • you can use lookups inside `TrigramSimilarity`'s expression as well, have a look at my edit – John Moutafis May 08 '17 at 11:08
  • That gives me: `Cannot resolve keyword 'unaccent' into field. Join on 'unaccented' not permitted.` – Private May 08 '17 at 15:28
  • Have you followed the `unaccent` activation instructions on this link https://docs.djangoproject.com/en/1.11/ref/contrib/postgres/lookups/#unaccent ? – John Moutafis May 09 '17 at 07:37
  • I have. It seems this is a bug? – Private May 09 '17 at 08:43
  • Or it is not intended to work this way. you can report it and let me know as well, it seems interesting! I will edit my answer with the initial code that "mostly" works. – John Moutafis May 09 '17 at 08:49
  • Hey, @Private I am doing a pass through my answers and I was wondering about any news on the subject... Was my answer helpful? – John Moutafis Feb 09 '18 at 10:08
  • 1
    To be honest I don't remember. I don't think your answer solved it completely, but it did put me on the right track. I'll accept. – Private Feb 10 '18 at 22:24
  • digging deeper into the actual numbers. Your edit seems correct superficially, but it's actually just filtering without accents and then doing the similarity with accents again (which is wrong). Going to post the right answer... – juan Isaza Jul 06 '23 at 04:12
0

In case it's useful to someone, solutions above didn't work for me.

I had to do this:

from django.db.models import Transform
from django.contrib.postgres.search import TrigramSimilarity

class Unaccent2(Transform):
   function = "UNACCENT"
   lookup_name = "unaccent"

def remove_accents(input_str):
   nfkd_form = unicodedata.normalize('NFKD', input_str)
   only_ascii = nfkd_form.encode('ASCII', 'ignore')
   return only_ascii.decode()

qs = Author.objects.annotate(similarity=TrigramSimilarity(Unaccent2('name'), remove_accents(q))
qs = qs.filter(similarity__gt=0.3).order_by('-similarity')
Mario C.
  • 1
  • 1
0

Use unaccent extension + trigram similarity extension (both can be installed in postgres by running a migration, see:

https://stevenwithph.medium.com/installing-postgres-extensions-with-django-migration-files-462669984bc5

Then:

from django.contrib.postgres.search import TrigramSimilarity
from django.contrib.postgres.lookups import Unaccent

Author.objects.annotate(
    similarity=TrigramSimilarity(Unaccent('name'), test),
).filter(similarity__gt=0.3).order_by('-similarity')

Notice that when doing an unaccent on the right side of the annotate(), use the function rather than the __unaccent notation.

juan Isaza
  • 3,646
  • 3
  • 31
  • 37