2

Context

I am quite new to Django and I am trying to write a complex query that I think would be easily writable in raw SQL, but for which I am struggling using the ORM.

Models

I have several models named SignalValue, SignalCategory, SignalSubcategory, SignalType, SignalSubtype that have the same structure like the following model:

class MyModel(models.Model):
    id = models.BigAutoField(primary_key=True)
    name = models.CharField()
    fullname = models.CharField()

I also have explicit models that represent the relationships between the model SignalValue and the other models SignalCategory, SignalSubcategory, SignalType, SignalSubtype. Each of these relationships are named SignalValueCategory, SignalValueSubcategory, SignalValueType, SignalValueSubtype respectively. Below is the SignalValueCategory model as an example:

class SignalValueCategory(models.Model):
    signal_value = models.OneToOneField(SignalValue)
    signal_category = models.ForeignKey(SignalCategory)

Finally, I also have the two following models. ResultSignal stores all the signals related to the model Result:

class Result(models.Model):
    pass


class ResultSignal(models.Model):
    id = models.BigAutoField(primary_key=True)

    result = models.ForeignKey(
        Result
    )
    signal_value = models.ForeignKey(
        SignalValue
    )

Query

What I am trying to achieve is the following. For a given Result, I want to retrieve all the ResultSignals that belong to it, filter them to keep the ones of my interest, and annotate them with two fields that we will call filter_group_id and filter_group_name. The values of two fields are determined by the SignalValue of the given ResultSignal.

From my perspective, the easiest way to achieve this would be first to annotate the SignalValues with their corresponding filter_group_name and filter_group_id, and then to join the resulting QuerySet with the ResultSignals. However, I think that it is not possible to join two QuerySets together in Django. Consequently, I thought that we could maybe use Prefetch objects to achieve what I am trying to do, but it seems that I am unable to make it work properly.

Code

I will now describe the current state of my queries.

First, annotating the SignalValues with their corresponding filter_group_name and filter_group_id. Note that filter_aggregator in the following code is just a complex filter that allows me to select the wanted SignalValues only. group_filter is the same filter but as a list of subfilters. Additionally, filter_name_case is a conditional expression (Case() construct):

# Attribute a group_filter_id and group_filter_name for each signal
signal_filters = SignalValue.objects.filter(
    filter_aggregator
).annotate(
    filter_group_id=Window(
        expression=DenseRank(),
        order_by=group_filters
    ),
    filter_group_name=filter_name_case
)

Then, trying to join/annotate the SignalResults:

prefetch_object = Prefetch(
    lookup="signal_value",
    queryset=signal_filters,
    to_attr="test"
 )

result_signals: QuerySet = (
    last_interview_result
        .resultsignal_set
        .filter(signal_value__in=signal_values_of_interest)
        .select_related(
            'signal_value__signalvaluecategory__signal_category', 
            'signal_value__signalvaluesubcategory__signal_subcategory',
            'signal_value__signalvaluetype__signal_type',
            'signal_value__signalvaluesubtype__signal_subtype',
        )
        .prefetch_related(
            prefetch_object
        )
        .values(
            "signal_value",
            "test",
            category=F('signal_value__signalvaluecategory__signal_category__name'), 
            subcategory=F('signal_value__signalvaluesubcategory__signal_subcategory__name'),
            type=F('signal_value__signalvaluetype__signal_type__name'),
            subtype=F('signal_value__signalvaluesubtype__signal_subtype__name'),
        )
)

Normally, from my understanding, the resulting QuerySet should have a field "test" that is now available, that would contain the fields of signal_filter, the first QuerySet. However, Django complains that "test" is not found when calling .values(...) in the last part of my code: Cannot resolve keyword 'test' into field. Choices are: [...]. It is like the to_attr parameter of the Prefetch object was not taken into account at all.

Questions

  1. Did I missunderstand the functioning of annotate() and prefetch_related() functions? If not, what am I doing wrong in my code for the specified parameter to_attr to not exist in my resulting QuerySet?
  2. Is there a better way to join two QuerySets in Django or am I better off using RawSQL? An alternative way would be to switch to Pandas to make the join in-memory, but it is very often more efficient to do such transformations on the SQL side with well-designed queries.
MetaHG
  • 21
  • 3

2 Answers2

0

You're on the right path, but just missing what prefetch does.

  1. Your annotations are correct, but the "test" prefetch isn't really an attribute. You batch up the SELECT * FROM signal_value queries so you don't have to execute the select per row. Just drop the "test" annotation and you should be fine. https://docs.djangoproject.com/en/3.2/ref/models/querysets/#prefetch-related

  2. Please don't use pandas, it's definitely not necessary and is a ton of overhead. As you say yourself, it's more efficient to do the transforms on the sql side

Murdahl
  • 11
  • 4
  • I am not quite sure to understand what you mean. Do you mean that I should drop the parameter `to_attr` of `prefetch_object` and that I should then be able to access the `signal_value` custom annotated fields like so `.values([...], signal_value__filter_group_id)`? If that's the case, I have tested it, but I get a similar error where `group_filter_id` cannot be resolved. And regarding "test" not being an attribute, it looks like one to me from the doc. So I must miss something. Could you also clarify this part, please? – MetaHG Sep 22 '21 at 16:30
  • I came across a similar example today actually. Though it's not accessing the prefretch in the queryset. I think you could go to access per result_signal: `for result_signal in result_signals: test = next( iter(result_signal.test), None )` – Murdahl Oct 01 '21 at 13:09
0

From the docs on prefetch_related:

Remember that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously cached results, and retrieve data using a fresh database query.

It's not obvious but the values() call is part of these chained methods that imply a different query, and will actually cancel prefetch_related. This should work if you remove it.

Brian Destura
  • 11,487
  • 3
  • 18
  • 34
  • I have tried to remove the `.values()` call, but I still cannot access the `to_attr="test"` value after that. More precisely, I did the following after removing the call: `result_signals.first().test`. I think it should be the appropriate way to access what I want right? Regarding the chained methods that imply a different database query, the `.values()` method is supposed to be the equivalent of a simple `SELECT` clause in SQL. What's the point of using the ORM if we cannot even do that on prefetched annotated QuerySets? It does not really make sense to have to do the selection in Python. – MetaHG Sep 23 '21 at 07:34
  • `result_signals.first()` also implies another query, so what you can do in this case is `result_signals.all()[0].test` as described in the docs. About `.values()`, it also does a `group by`. I think what you can use in this case is `.only()` but I haven't tried if it will break the cache as well. – Brian Destura Sep 23 '21 at 07:45
  • So I have just checked, but it does not seem to work better :(. Still having the same error when using `result_signals.all()[0].test`, i.e., `'ResultSignal' object has no attribute 'test'`. – MetaHG Sep 23 '21 at 09:21
  • I see can you share the all the latest relevant code you have? – Brian Destura Sep 23 '21 at 09:44
  • Actually, I have put the code from my post inside a separate function to avoid modifying it until I would have found the right solution here on SO (and in the meantime as a temporary solution, I am using the mentioned alternative of Pandas in a different function, even though I am well-aware of the overhead taking place). So there is no code update since then. Let me know if there is any other information that I could give to help find the source of my issue! – MetaHG Sep 26 '21 at 19:04