2

So here are my models:

class Event(models.Model):
    user = models.ForeignKey(User, blank=True, null=True, db_index=True)
    name = models.CharField(max_length = 200, db_index=True)
    platform = models.CharField(choices = (("ios", "ios"), ("android", "android")), max_length=50)

class User(AbstractUser):
    email = models.CharField(max_length=50, null=False, blank=False, unique=True)

Event is like an analytics event, so it's very possible that I could have multiple events for one user, some with platform=ios and some with platform=android, if a user has logged in on multiple devices. I want to query to see how many users have both ios and android devices. So I wrote a query like this:

User.objects.filter(Q(event__platform="ios") & Q(event__platform="android")).count()

Which returns 0 results. I know this isn't correct. I then thought I would try to just query for iOS users:

User.objects.filter(Q(event__platform="ios")).count()

Which returned 6,717,622 results, which is unexpected because I only have 39,294 users. I'm guessing it's not counting the Users, but counting the Event instances, which seems like incorrect behavior to me. Does anyone have any insights into this problem?

Chase Roberts
  • 9,082
  • 13
  • 73
  • 131
  • The second query looks fine, try adding `.order_by()` before count to remove any default ordering and see if it works then. Default ordering defined in a model's `Meta` can sabotage you in subtle ways. – Endre Both Mar 29 '19 at 09:16
  • I added a `.order_by('user_id')` with the same results. What does work is to add `.distinct('id')`, although the query still takes a very long time. My guess is that I'm not going to get it faster without flattening my database structure. – Chase Roberts Mar 30 '19 at 12:03
  • I assume you already have an [index](https://docs.djangoproject.com/en/2.1/ref/models/fields/#db-index) on `platform`. Using integers instead of strings as suggested by Navid also helps. Finally, raw SQL that doesn't use joins but accesses the `Event` table only should speed up your queries by up to two orders of magnitude (while still not producing instant results for a table of this size). – Endre Both Apr 01 '19 at 08:07
  • 2 orders of magnitude should be perfect. – Chase Roberts Apr 03 '19 at 13:31

2 Answers2

2

You can use annotations instead:

django.db.models import Count

User.objects.all().annotate(events_count=Count('event')).filter(events_count=2)

So it will filter out any user that has two events.

You can also use chained filters:

User.objects.filter(event__platform='android').filter(event__platform='ios')

Which first filter will get all users with android platform and the second one will get the users that also have iOS platform.

Endre Both
  • 5,540
  • 1
  • 26
  • 31
Navid Zarepak
  • 4,148
  • 1
  • 12
  • 26
  • The first query annotates the counts per user. The second works fine, apologies for my earlier comment. It seems identical to the asker's first query, but it is not when not filtering x-to-many relationships like here. – Endre Both Mar 29 '19 at 08:45
  • I shouldn't be picky, but I don't particularly like this answer. Calling `User.objects.filter(event__platform='android')` causes a join and returns >6M results, when my question is why don't I get 39k results. I think the answer is that I need to use `.distinct()`, which seems contradictory to what the docs say. And then chaining it with another `.filter()` is going to cause another join which doesn't exactly return quickly when you have a table with >6M rows. – Chase Roberts Mar 30 '19 at 11:41
  • You're going to filter your results based on another table so ofc you will have joins. it's how you designed your database and this is how a database works. you can get ids for android and ios and use them which is faster than filtering by strings. you can also do the second filter on some distinced queryset which is faster. How to do optimization on these kind of queries is something beyond this question and you can open another question and start discussing that but this is the answer for your question using django orm unless you want to use raw sql which still another topic about sql. – Navid Zarepak Mar 30 '19 at 12:01
0

This is generally an answer for a queryset with two or more conditions related to children objects.

Solution: A simple solution with two subqueries is possible, even without any join:

base_subq = Event.objects.values('user_id').order_by().distinct()
user_qs = User.objects.filter(
    Q(pk__in=base_subq.filter(platform="android")) &
    Q(pk__in=base_subq.filter(platform="ios"))
)

The method .order_by() is important if the model Event has a default ordering (see it in the docs about distinct() method).


Notes:

Verify the only SQL request that will be executed: (Simplified by removing "app_" prefix.)

>>> print(str(user_qs.query))
SELECT user.id, user.email FROM user WHERE (
    user.id IN (SELECT DISTINCT U0.user_id FROM event U0 WHERE U0.platform = 'android')
    AND
    user.id IN (SELECT DISTINCT U0.user_id FROM event U0 WHERE U0.platform = 'ios')
)
  • The function Q() is used because the same condition parameter (pk__in) can not be repeated in the same filter(), but also chained filters could be used instead: .filter(...).filter(...). (The order of filter conditions is not important and it is outweighed by preferences estimated by SQL server optimizer.)
  • The temporary variable base_subq is an "alias" queryset only to don't repeat the same part of expression that is never evaluated individually.
  • One join between User (parent) and Event (child) wouldn't be a problem and a solution with one subquery is also possible, but a join with Event and Event (a join with a repeated children object or with two children objects) should by avoided by a subquery in any case. Two subqueries are nice for readability to demonstrate the symmetry of the two filter conditions.

Another solution with two nested subqueries This non symmetric solution can be faster if we know that one subquery (that we put innermost) has a much more restrictive filter than another necessary subquery with a huge set of results. (example if a number of Android users would be huge)

ios_user_ids = (Event.objects.filter(platform="ios")
                .values('user_id').order_by().distinct())
user_ids = (Event.objects.filter(platform="android", user_id__in=ios_user_ids)
            .values('user_id').order_by().distinct())
user_qs = User.objects.filter(pk__in=user_ids)

Verify how it is compiled to SQL: (simplified again by removing app_ prefix and ".)

>>> print(str(user_qs.query))
SELECT user.id, user.email FROM user 
WHERE user.id IN (
    SELECT DISTINCT V0.user_id FROM event V0
    WHERE V0.platform = 'ios' AND V0.user_id IN (
        SELECT DISTINCT U0.user_id FROM event U0
        WHERE U0.platform = 'android'
    )
)

(These solutions work also in an old Django e.g. 1.8. A special subquery function Subquery() exists since Django 1.11 for more complicated cases, but we didn't need it for this simple question.)

hynekcer
  • 14,942
  • 6
  • 61
  • 99