0

Suppose I have the following table:

class Word(Model):
    id = UUIDField(primary_key=True,
                   default=uuid.uuid4,
                   editable=False)
    name = CharField(_('Word'),
                     db_index=True,
                     max_length=250)

This table contains around 3 million rows. Content is mainly Cyrillian words.

What I want to accomplish is to search through some predefined list of values. The problem is that words in the table are accented.

Content of tables is like:

мо́рква
буря́к
я́блуко

But I query without accent.

words = ['морква', 'буряк', 'яблуко']

Word.objects.annotated(name_without_accent=Func(
            F('name'),Value('[%s]' % ACCENTS), Value(''),
            function='regexp_replace',
        )).filter(reduce(operator.or_, [Q(name_without_accent__icontains=w) for w in words], Q()))

The problem is that this query on such big table runs very long. An words list can contain more elements (around 20-30).

Is there anyway to improve overall performance of querying such big table? Improve Django query? Tune Postgres? Use some external tools?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Paul R
  • 2,631
  • 3
  • 38
  • 72

1 Answers1

2

Try to use unicodedata.normalize('NFD', word) instead of regexp. And if you do a lot of such queries it would be reasonable to make additional field with unAccent versions of words and make index by that field, because in your situation db_index doesn't give any help it just slows table with reindexing.

vZ10
  • 2,468
  • 2
  • 23
  • 33
  • Where should I use `unicodedata.normalize('NFD', word)`? – Paul R Jun 20 '17 at 20:46
  • In annotation field – vZ10 Jun 21 '17 at 03:09
  • Forgot my previous comment about annotation ((( It was too early in the morning for answering, sorry ((( I think the best solution would be creation of additional field or even additional temporary table – vZ10 Jun 21 '17 at 07:27
  • Did as you told, but still `contains` is very slow. How can I improve it? Query runs 7 seconds (instead of 65 as earlier). – Paul R Jun 21 '17 at 22:14
  • 7 seconds even after indexing that field? By, the way if you index it using Django migrations, check in db if it really crates index. Sometimes Django indexing doesn't have effect. – vZ10 Jun 22 '17 at 05:38
  • I use scheme above, just to test that it is working: `Word.objects.filter(reduce(operator.or_, [Q(name__contains=w) for w in words], Q()))` – Paul R Jun 22 '17 at 07:00
  • I didn't get, do add an additional field with unAccent words? – vZ10 Jun 22 '17 at 09:02
  • No. I just tested whether field comparison will be faster against regex. It is faster, but still slow. – Paul R Jun 22 '17 at 09:13
  • Tell me more about your use case why do you use Q objects and contains? There can be a string in the 'name' field and part of it could be 'буряк'? – vZ10 Jun 22 '17 at 09:49
  • Установил Gin Index https://stackoverflow.com/questions/43793987/how-to-create-gin-index-in-django-migration .Безрезультатно. – Paul R Jun 22 '17 at 12:07
  • да. я копировал запрос из django, потому что там несколько слов. `SELECT "word_ukword"."id", "word_ukword"."name" FROM "word_ukword" WHERE ("word_ukword"."name"::text LIKE '%морква%' OR "word_ukword"."name"::text LIKE '%моркві%' OR "word_ukword"."name"::text LIKE '%моркви%' OR "word_ukword"."name"::text LIKE '%моркво%' OR "word_ukword"."name"::text LIKE '%морквою%' OR "word_ukword"."name"::text LIKE '%моркву%')` – Paul R Jun 22 '17 at 12:40