0

Documents is the parent table. Paragraphs is the child table.

Users filter Documents based on various search criteria. Then I wish to annotate Documents with certain Paragraphs filtered by a text query. The same text query is used to filter Documents and rank them (SearchRank). This ranking makes it necessary to start from Documents and annotate them with Paragraphs, instead of starting from Paragraphs and grouping them by Document.

The postgresql way of concatenating one text field from multiple rows in Paragraphs would be the following:

SELECT array_to_string(
ARRAY(
SELECT paragraph.text
FROM paragraph
WHERE document id = '...'
ORDER BY paragraph.number),
', ');

I am trying to translate this into django coding.

I have tried numerous django approaches, to no avail. I can annotate 1 Paragraph. Query_sum is a Q() object built from user input.

results = Documents.filter(Query_sum)

sub_paragraphs = Paragraphs.filter(Query_sum).filter(document=OuterRef('id'))

results = results.annotate(paragraphs=Subquery(sub_paragraphs.values('text')[:1], output_field=TextField()))

Problems start when I get rid of slicing [:1].

results = results.annotate(paragraphs=Subquery(sub_paragraphs.values('text'), output_field=TextField()))

I then get the following error: "more than one row returned by a subquery used as an expression".

To fix that, I tried to use ArrayAgg and StringAgg. I made quite a mess ;-)

The Documents queryset (result) should be annotated either with a list of relevant Paragraphs (ArrayAgg), or a string of Paragraphs separated by any delimiter (StringAgg).

Any idea of how to proceed? I would be extremely grateful

Iain Shelvington
  • 31,030
  • 3
  • 31
  • 50
edyas
  • 1
  • 1
  • 2

1 Answers1

4

We can annotate and order the documents with the number of paragraphs it has that match the query by using annotate with Sum, Case and When

documents = Document.objects.annotate(
    matches=Sum(Case(
        # This could depend on the related name for the paragraph -> document relationship
        When(paragraphs__text__icontains=search_string, then=Value(1)),
        default=Value(0),
        output_field=IntegerField(),
    )))
).order_by('-matches')

Then, to get all the paragraphs that match the query for each document we an use prefetch_related. We can use a Prefetch object to filter the prefetch operation

documents = documents.prefetch_related(Prefetch(
    'paragraphs',
    queryset=Paragraph.objects.filter(text__icontains=search_string),
    to_attrs='matching_paragraphs'
))

You can then loop over the documents in ranked order and they will have an attribute "matching_paragraphs" that contains all the matching paragraphs

Iain Shelvington
  • 31,030
  • 3
  • 31
  • 50
  • I love it! The beauty of simplicity. I had to tweak the Prefetch a little, 'paragraph_set' instead of 'paragraphs' and 'to_attr' instead of 'to_attrs'. Many thanks! – edyas Jul 11 '19 at 09:51