4

I have two models Author and Book which are related via m2m (one author can have many books, one book can have many authors)

Often we need to query and match records for ingests using text strings, across both models ie: "JRR Tolkien - Return of the King" when unique identifiers are not available.

I would like to test if using SearchVectorField with GIN indexes can improve full-text search response times - but since the search query will be SearchVector(author__name, book__title) It seems that both models need a SearchVectorField added.

This becomes more complicated when each table needs updating since it appears Postgres Triggers need to be set up on both tables, which might make updating anything completely untenable.

Question

What is the modern best practice in Django for adopting vectorised full-text search methods when m2m related models are concerned? Should the SearchVectorField be placed through a table? Or in each model? How should triggers be applied?

I've been searching for guides on this specifically - but no one seems to mention m2ms when talking about SearchVectorFields. I did find this old question

Also, if Postgres is really not the way forward in modern Django I'd also gladly take direction in something better suited/supported/documented. In our case, we are using Postgres 11.6.

Repro

from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Author(models.Model):
    name = models.CharField(max_length=100, unique=True)
    main_titles = models.ManyToManyField(
        "Book",
        through="BookMainAuthor",
        related_name="main_authors",
    )
    search = SearchVectorField(null=True)

class BookMainAuthor(models.Model):
    """The m2m through table for book and author (main)"""

    book = models.ForeignKey("Book", on_delete=models.CASCADE)
    artist = models.ForeignKey("Author", on_delete=models.CASCADE)

    class Meta:
        unique_together = ["book", "author"]

class Book(models.Model):
    title = models.CharField(max_length=100, unique=True)
    search = SearchVectorField(null=True)

Exploring indexing the M2M Through table

Exploring Yevgeniy-kosmak's answer below, this is a simple way to index the string permutations of the through table for Book.title and Author.name

Performing a search using the SearchVectorField is fast and a little more effective for some titles that have multiple authors.

However when trying to use SearchRank - things slow down dramatically:

BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).order_by("-rank:).explain(analyze=True)

"Gather Merge  (cost=394088.44..489923.26 rows=821384 width=227) (actual time=8569.729..8812.096 rows=989307 loops=1)
Workers Planned: 2
Workers Launched: 2
  ->  Sort  (cost=393088.41..394115.14 rows=410692 width=227) (actual time=8559.074..8605.681 rows=329769 loops=3)
        Sort Key: (ts_rank(to_tsvector(COALESCE((search_vector)::text, ''::text)), plainto_tsquery('JRR Tolkien - Return of the King'::text), 6)) DESC
        Sort Method: external merge  Disk: 77144kB
 – 

Worker 0:  Sort Method: external merge  Disk: 76920kB
        Worker 1:  Sort Method: external merge  Disk: 76720kB
        ->  Parallel Seq Scan on bookstore_bookmainauthor  (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.589..8378.569 rows=329769 loops=3)
Planning Time: 0.369 ms
Execution Time: 8840.139 ms"

Without the sort, only saves 500ms:

BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).explain(analyze=True)

'Gather  (cost=1000.00..364517.21 rows=985661 width=227) (actual time=0.605..8282.976 rows=989307 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.356..8187.242 rows=329769 loops=3)
Planning Time: 0.039 ms
Execution Time: 8306.799 ms'

However I noticed that if you do the following, it dramatically improves the query execution time (~17x), sorting included.

  1. Add an F Expression to the first argument of SearchRank (instead of using the name of the field in quotes which is what is directed to do in the documentation)
  2. Adding a config kwarg to the SearchQuery
BookMainAuthor.objects.annotate(rank=SearchRank(F("search"), SearchQuery("JRR Tolkien - Return of the King", config='english')).order_by("-rank").explain(analyze=True)

Gather Merge  (cost=304240.66..403077.76 rows=847116 width=223) (actual time=336.654..559.367 rows=989307 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=303240.63..304299.53 rows=423558 width=223) (actual time=334.970..373.282 rows=329769 loops=3)
        Sort Key: (ts_rank(search_vector, '''jrr'' & ''tolkien'' & ''return'' & ''king'''::tsquery)) DESC
        Sort Method: external merge  Disk: 75192kB
        Worker 0:  Sort Method: external merge  Disk: 76672kB
        Worker 1:  Sort Method: external merge  Disk: 76976kB
        ->  Parallel Seq Scan on bookstore_bookmainauthor  (cost=0.00..173893.48 rows=423558 width=223) (actual time=0.014..211.007 rows=329769 loops=3)
Planning Time: 0.059 ms
Execution Time: 584.402 ms

lys
  • 949
  • 2
  • 9
  • 33
  • Could you explain your hardest goal on this task? For now, it seems to me that structure you suggested would work just fine, while you need _separately_ select `Author`s and `Book`s and only then join them. What difficulty exactly are you afraid of? – Yevgeniy Kosmak Dec 22 '21 at 16:36
  • Not a matter of fear, it’s a matter of avoiding this: `FieldError: Joined field references are not permitted in this query` Do you mean separately query them and then join after? That’s basically what I’ve done already – lys Dec 22 '21 at 17:49
  • Can’t the through table be indexed? Seems like this would be the best way to optimise the search across m2m permutations – lys Dec 22 '21 at 17:51
  • The idea is to use the search vector across strings that match in fields in *both* tables – lys Dec 22 '21 at 17:54

1 Answers1

3

Finally got it. I suppose you need to search by query containing the author and the book's name at the same time. And you wouldn't be able to separate them to look at Book table for "book" part of the query and the same for Author.

Yep, making an index of fields from separate tables is impossible with PostgreSQL. I don't see it as a weakness of PostgreSQL, it's just a very unusual case when you really need such an index. In most cases, there are other solutions, not worse as for efficiency. Of course, you can always look at ElasticSearch if for some reason you are sure that it's necessary.

I'll advise you of such an approach. You can make BookMainAuthor with this structure:

class BookMainAuthor(models.Model):
    """The m2m through table for book and author (main)"""

    book = models.ForeignKey("Book", on_delete=models.CASCADE)
    artist = models.ForeignKey("Author", on_delete=models.CASCADE)
    book_full_name = models.CharField(max_length=200)
    search = SearchVectorField(null=True)

    class Meta:
        unique_together = ["book", "author"]

As it seems to me it shouldn't cause any trouble to maintain book_full_name field, which would contain both author and book names with an appropriate separator in it. Everything else is a textbook case.

From my experience, if table BookMainAuthor would contain not more than 10M entries, on an average single server (for example like AX161 from here) everything would be just fine.

Yevgeniy Kosmak
  • 3,561
  • 2
  • 10
  • 26
  • Interesting - I had wondered if something like this would work. Unfortunately, it's not performant. Applying an index to this table doesn't do anything, Postgres opts for a parallel sequential scan every time. Because it has to scan the whole table - this isn't a great solution. This is in a table with only 990k records. – lys Dec 24 '21 at 07:26
  • Ah but adding `.values('title')` brings the execution time down to `Execution Time: 128.246 ms` ! This is 58.5x faster. Though it's still not using the GIN Index. Perhaps a composite index would be better suited here... – lys Dec 24 '21 at 08:03
  • I’m pretty sure you have some issues in the implementation. The task of finding entries `BookMainAuthor` for sure can and has to be solved with `GIN` index. Could you please share the ORM code, the produced SQL query and `EXPLAIN ANALYZE` on that query on your machine? I might help you with narrowing down the issue. – Yevgeniy Kosmak Dec 24 '21 at 08:19
  • Great, sure thing. Firstly I had to alter this line: `book_full_name = models.TextField(blank=True)` since we can't create a unique field that is blank or with a default. Each row is already unique. Next: I added this to `class Meta:` ``` indexes = ( GinIndex( name="BookMainAuthorIndex", fields=["search"], fastupdate=False ), ) unique_together = ["book", "author"] ``` This is updated for every object using Subquery, OuterRef and `.update` – lys Dec 24 '21 at 09:03
  • Then, to query: `BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).order_by("-rank:).explain(analyze=True)` – lys Dec 24 '21 at 09:10
  • `"Gather Merge (cost=394088.44..489923.26 rows=821384 width=227) (actual time=8569.729..8812.096 rows=989307 loops=1)\n Workers Planned: 2\n Workers Launched: 2\n -> Sort (cost=393088.41..394115.14 rows=410692 width=227) (actual time=8559.074..8605.681 rows=329769 loops=3)\n Sort Key: (ts_rank(to_tsvector(COALESCE((search_vector)::text, ''::text)), plainto_tsquery('JRR Tolkien - Return of the King'::text), 6)) DESC\n Sort Method: external merge Disk: 77144kB\n` – lys Dec 24 '21 at 09:11
  • `Worker 0: Sort Method: external merge Disk: 76920kB\n Worker 1: Sort Method: external merge Disk: 76720kB\n -> Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.589..8378.569 rows=329769 loops=3)\nPlanning Time: 0.369 ms\nExecution Time: 8840.139 ms"` – lys Dec 24 '21 at 09:11
  • Without the sort: `BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).explain(analyze=True)` ....... `'Gather (cost=1000.00..364517.21 rows=985661 width=227) (actual time=0.605..8282.976 rows=989307 loops=1)\n Workers Planned: 2\n Workers Launched: 2\n -> Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.356..8187.242 rows=329769 loops=3)\nPlanning Time: 0.039 ms\nExecution Time: 8306.799 ms'` – lys Dec 24 '21 at 09:16
  • Using `.values('book')` : `BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).values('book').explain(analyze=True)` ...... `'Seq Scan on bookstore_bookmainauthor (cost=0.00..62274.61 rows=985661 width=45) (actual time=0.354..100.761 rows=989307 loops=1)\nPlanning Time: 0.049 ms\nExecution Time: 118.056 ms'` – lys Dec 24 '21 at 09:19
  • Figured out the error in my implementation? – lys Dec 25 '21 at 09:56
  • Lots of details posted above? – lys Dec 25 '21 at 10:03
  • Oh, I misread your message, didn’t notice the question mark. Well, I’m not sure for now. It would be much easier to read if you add all this data to your. And also didn’t added the computed SQL queries, but only their `ANALYZE`s. – Yevgeniy Kosmak Dec 25 '21 at 10:07
  • 1
    Ah of course - let me add everything to the original post. I discovered some interesting subtleties in Django's syntax which really affects this – lys Dec 25 '21 at 22:53
  • If you edit your answer to factor in the problem with `Unique=True` on the model field I will award the bounty and mark this as answered :) – lys Dec 26 '21 at 22:51
  • Wait, so the unique constraint was the main issue? And if you remove it, everything works fast? – Yevgeniy Kosmak Dec 26 '21 at 23:10
  • No, the unique contraint doesn't work for full text, since only the PKs for each model are unique, the full text is not unique in all cases. Also the model won't migrate without a default value for this field either, so it's not a working solution. – lys Dec 26 '21 at 23:15
  • I removed it. Very grateful for the bounty for my incomplete assistance. But I can't say I understand the situation completely and it is quite interesting. I'd just become familiar with [DB fiddle](https://www.db-fiddle.com/) services, and it would be helpful for me to play around with your case a bit more. So it would be very nice of you if you'd share the final DDLs of yours. – Yevgeniy Kosmak Dec 26 '21 at 23:22
  • 1
    Sounds great - i'd appreciate your assistance! Let me set up a db fiddle, I will need to think about it a little as I'm more familiar with Django that straight SQL – lys Dec 26 '21 at 23:29