2

After solving the problem I asked about in this question, I am trying to optimize performance of the FTS using indexes. I issued on my db the command:

CREATE INDEX my_table_idx ON my_table USING gin(to_tsvector('italian', very_important_field), to_tsvector('italian', also_important_field), to_tsvector('italian', not_so_important_field), to_tsvector('italian', not_important_field), to_tsvector('italian', tags));

Then I edited my model's Meta class as follows:

class MyEntry(models.Model):
    very_important_field = models.TextField(blank=True, null=True)
    also_important_field = models.TextField(blank=True, null=True)
    not_so_important_field = models.TextField(blank=True, null=True)
    not_important_field = models.TextField(blank=True, null=True)
    tags = models.TextField(blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'my_table'
        indexes = [
            GinIndex(
                fields=['very_important_field', 'also_important_field', 'not_so_important_field', 'not_important_field', 'tags'],
                name='my_table_idx'
            )
        ]

But nothing seems to have changed. The lookup takes exactly the same amount of time as before.

This is the lookup script:

from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector

# other unrelated stuff here
vector = SearchVector("very_important_field", weight="A") + \
             SearchVector("tags", weight="A") + \
             SearchVector("also_important_field", weight="B") + \
             SearchVector("not_so_important_field", weight="C") + \
             SearchVector("not_important_field", weight="D")
query = SearchQuery(search_string, config="italian")
rank = SearchRank(vector, query, weights=[0.4, 0.6, 0.8, 1.0]). # D, C, B, A
full_text_search_qs = MyEntry.objects.annotate(rank=rank).filter(rank__gte=0.4).order_by("-rank")

What am I doing wrong?

Edit:

The above lookup is wrapped in a function I use a decorator on to time. The function actually returns a list, like this:

@timeit
def search(search_string):
    # the above code here
    qs = list(full_text_search_qs)
    return qs

Might this be the problem, maybe?

Biagio Distefano
  • 130
  • 1
  • 13

2 Answers2

4

You need to add a SearchVectorField to your MyEntry, update it from your actual text fields and then perform the search on this field. However, the update can only be performed after the record has been saved to the database.

Essentially:

from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField

class MyEntry(models.Model):
    # The fields that contain the raw data.
    very_important_field = models.TextField(blank=True, null=True)
    also_important_field = models.TextField(blank=True, null=True)
    not_so_important_field = models.TextField(blank=True, null=True)
    not_important_field = models.TextField(blank=True, null=True)
    tags = models.TextField(blank=True, null=True)

    # The field we actually going to search.
    # Must be null=True because we cannot set it immediately during create()
    search_vector = SearchVectorField(editable=False, null=True)  

    class Meta:
        # The search index pointing to our actual search field.
        indexes = [GinIndex(fields=["search_vector"])]

Then you can create the plain instance as usual, for example:

# Does not set MyEntry.search_vector yet.
my_entry = MyEntry.objects.create(
    very_important_field="something very important",  # Fake Italien text ;-)
    also_important_field="something different but equally important"
    not_so_important_field="this one matters less"
    not_important_field="we don't care are about that one at all"
    tags="things, stuff, whatever"

Now that the entry exists in the database, you can update the search_vector field using all kinds of options. For example weight to specify the importance and config to use one of the default language configurations. You can also completely omit fields you don't want to search:

# Update search vector on existing database record.
my_entry.search_vector = (
    SearchVector("very_important_field", weight="A", config="italien")
    + SearchVector("also_important_field", weight="A", config="italien")
    + SearchVector("not_so_important_field", weight="C", config="italien")
    + SearchVector("tags", weight="B", config="italien")
)
my_entry.save()

Manually updating the search_vector field every time some of the text fields change can be error prone, so you might consider adding an SQL trigger to do that for you using a Django migration. For an example on how to do that see for instance a blog article on Full-text Search with Django and PostgreSQL.

To actually search in MyEntry using the index you need to filter and rank by your search_vector field. The config for the SearchQuery should match the one of the SearchVector above (to use the same stopword, stemming etc).

For example:

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.core.exceptions import ValidationError
from django.db.models import F, QuerySet

search_query = SearchQuery("important", search_type="websearch", config="italien")
search_rank = SearchRank(F("search_vector"), search_query)
my_entries_found = (
    MyEntry.objects.annotate(rank=search_rank)
    .filter(search_vector=search_query)  # Perform full text search on index.
    .order_by("-rank")  # Yield most relevant entries first.
)
roskakori
  • 3,139
  • 1
  • 30
  • 29
0

I'm not sure but according to postgresql documentation (https://www.postgresql.org/docs/9.5/static/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX):

Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index. That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b' cannot. This ensures that an index will be used only with the same configuration used to create the index entries.

I don't know what configuration django uses but you can try to remove first argument

S-Man
  • 22,521
  • 7
  • 40
  • 63
Kanarsky
  • 162
  • 1
  • 10
  • What do you mean by removing the first argument? Where should I remove it? If I remove the argument in to_tsvector() I won't be able to search in the correct language. – Biagio Distefano Feb 13 '18 at 20:15
  • Do you use this language argument as a config parametr in searchquery? – Kanarsky Feb 14 '18 at 11:28
  • Also, you can try to store ts_vector data as an additional field in your table, create index on that field and use it in your search queries. And how do you evaluate the amount of time? – Kanarsky Feb 14 '18 at 11:41
  • I edited the question with some more info about the timing. I'll try to add a field as you suggested and come back with a feedback. If the problem is the fact that I turn the QuerySet into a list I'll have another major problem, because I'm using graphql and there's a bug for which I must return a list when annotating and using order_by("-rank") – Biagio Distefano Feb 14 '18 at 14:42
  • And yes, the language argument is used both when I create the index and when I make the query. I tried playing with it but with no effect. – Biagio Distefano Feb 14 '18 at 14:46
  • You can always try to send query directly using pgadmin or psql – Kanarsky Feb 15 '18 at 11:55