10

So I have a booking system. Agents (the people and organisations submitting bookings) are only allowed to make booking in the categories we assign them. Many agents can assign to the same categories. It's a simple many-to-many. Here's an idea of what the models look like:

class Category(models.Model):
    pass

class Agent(models.Model):
    categories = models.ManyToManyField('Category')

class Booking(models.Model):
    agent = models.ForeignKey('Agent')
    category = models.ForeignKey('Category')

So when a booking comes in, we dynamically allocate the category based on which are available to the agent. The agent usually doesn't specify.

Can I select Bookings where Booking.category isn't in Booking.agent.categories?

We have just noticed that —by the grace of a silly admin mistake— some agents were allowed submit Bookings to any category. It has left us with thousands of bookings in the wrong place.

I can fix this but the I can only get it to work by nesting lookups:

for agent in Agent.objects.all():
    for booking in Booking.objects.filter(agent=agent):
        if booking.category not in agent.categories.all():
            # go through the automated allocation logic again

This works but it's super-slow. It's a lot of data flying between database and Django. This isn't a one-off either. I want to periodically audit new bookings to make sure they are in the correct place. It doesn't seem impossible that another admin issue will occur so after checking the Agent database, I want to query for Bookings that aren't in their agent's categories.

Again, nested queries will work not but as our datasets grow into millions (and beyond) I'd like to do this more efficiently..

I feel like it should be possible to do this with a F() lookup, something like this:

from django.db.models import F
bad = Booking.objects.exclude(category__in=F('agent__categories'))

But this doesn't work: TypeError: 'Col' object is not iterable

I've also tried .exclude(category=F('agent__categories')) and while it's happier with the syntax there, it doesn't exclude the "correct" bookings.

What's the secret formula for doing this sort of F() query on a M2M?


To help nail down exactly what I'm after I've set up a Github repo with these models (and some data). Please use them to write the query. The current sole answer hits and issue I was seeing on my "real" data too.

git clone https://github.com/oliwarner/djangorelquerytest.git
cd djangorelquerytest
python3 -m venv venv
. ./venv/bin/activate
pip install ipython Django==1.9a1

./manage.py migrate
./manage.py shell

And in the shell, fire in:

from django.db.models import F
from querytest.models import Category, Agent, Booking
Booking.objects.exclude(agent__categories=F('category'))

Is that a bug? Is there a proper way to achieve this?

Community
  • 1
  • 1
Oli
  • 235,628
  • 64
  • 220
  • 299

6 Answers6

5

There is a chance that I might be wrong, but I think doing it in reverse should do the trick:

bad = Booking.objects.exclude(agent__categories=F('category'))

Edit

If above won't work, here is another idea. I've tried similar logic on the setup I have and it seems to work. Try adding an intermediate model for ManyToManyField:

class Category(models.Model):
    pass

class Agent(models.Model):
    categories = models.ManyToManyField('Category', through='AgentCategory')

class AgentCategory(models.Model):
    agent = models.ForeignKey(Agent, related_name='agent_category_set')
    category = models.ForeignKey(Category, related_name='agent_category_set')

class Booking(models.Model):
    agent = models.ForeignKey('Agent')
    category = models.ForeignKey('Category')

Then you can do a query:

bad = Booking.objects.exclude(agent_category_set__category=F('category'))

Of course specifying an intermediate model has it's own implications, but I am sure you can handle them.

lehins
  • 9,642
  • 2
  • 35
  • 49
  • Should it be `bad = Booking.objects.exclude(agent_category_set__contains=F('category'))` – jcfollower Sep 28 '15 at 19:56
  • @jcfollower Nope, that will be an error, since with this query you are comparing to different tables. `agent_category_set` is `AgentCategory`, while `category` is `Category` model. Moreover, it will give you: `TypeError: Related Field got invalid lookup: contains` for this query. – lehins Sep 28 '15 at 20:04
2

Solution 1:

You can find the good bookings using this query

good = Booking.objects.filter(category=F('agent__categories'))

You can check the sql query for this

print Booking.objects.filter(category=F('agent__categories')).query

So you can exclude the good bookings from all bookings. Solution is :

Booking.objects.exclude(id__in=Booking.objects.filter(category=F('agent__categories')).values('id'))

It will create a MySql nested query which is the most optimized MySql query for this problem ( as far as i know ).

This MySql query will be a little heavy as you database is huge but it will hit database only once instead of your first attempt of loops which will hit for bookings * agent_categories times.

Also, you can make the dataset less by using filtering on date if you are storing those and you have approximation when the wrong booking started.

You can use the above command periodically to check for inconsistent bookings. But i would recommend to over ride the admin form and check while booking if category is correct or not. Also you can use some javascript to add only the categories in admin form which are present for selected/logged-in agent at that time.

Solution 2:

use prefetch_related, this will reduce your time drastically because very less database hits.

read about it here : https://docs.djangoproject.com/en/1.8/ref/models/querysets/

for agent in Agent.objects.all().prefetch_related('bookings, categories'):
    for booking in Booking.objects.filter(agent=agent):
        if booking.category not in agent.categories.all():
Sachin Gupta
  • 394
  • 3
  • 12
1

Usually when dealing with m2m relationships I take the hybrid approach. I would break the problem into two parts, a python and sql part. I find this speeds up the query a lot and it doesn't required any complicated query.

The first thing you want to do is get the agent to categories mapping, then use that mapping to determine the category that is not in the assignment.

def get_agent_to_cats():
    # output { agent_id1: [ cat_id1, cat_id2, ], agent_id2: [] }
    result = defaultdict(list)

    # get the relation using the "through" model, it is more efficient
    # this is the Agent.categories mapping
    for rel in Agent.categories.through.objects.all():
        result[rel.agent_id].append(rel.category_id)
    return result


def find_bad_bookings(request):
    agent_to_cats = get_agent_to_cats()

    for (agent_id, cats) in agent_to_cats.items():
        # this will get all the bookings that NOT belong to the agent's category assignments
        bad_bookings = Booking.objects.filter(agent_id=agent_id)
                                         .exclude(category_id__in=cats)

        # at this point you can do whatever you want to the list of bad bookings
        bad_bookings.update(wrong_cat=True)            

    return HttpResponse('Bad Bookings: %s' % Booking.objects.filter(wrong_cat=True).count())

Here is a little stats when I ran the test on my server: 10,000 Agents 500 Categories 2,479,839 Agent to Category Assignments 5,000,000 Bookings

2,509,161 Bad Bookings. Total duration 149 seconds

Du D.
  • 5,062
  • 2
  • 29
  • 34
0

This might speed it up ...

for agent in Agent.objects.iterator():
    agent_categories = agent.categories.all()
    for booking in agent.bookings.iterator():
        if booking.category not in agent_categories:
            # go through the automated allocation logic again
jcfollower
  • 3,103
  • 19
  • 25
0

This may not be what you're looking for, but you can use a raw query. I don't know if it can be done entirely within the ORM, but this works in your github repo:

Booking.objects.raw("SELECT id \
                     FROM querytest_booking as booking \
                     WHERE category_id NOT IN ( \
                         SELECT category_id \
                         FROM querytest_agent_categories as agent_cats \
                         WHERE agent_cats.agent_id = booking.agent_id);")

I assume the table names will be different for you, unless your app is called querytest. But either way, this can be iterated over for you to plug your custom logic into.

ryanmrubin
  • 728
  • 4
  • 11
0

You were almost there. First, let's create two booking elements:

# b1 has a "correct" agent
b1 = Booking.objects.create(agent=Agent.objects.create(), category=Category.objects.create())
b1.agent.categories.add(b1.category)

# b2 has an incorrect agent
b2 = Booking.objects.create(agent=Agent.objects.create(), category=Category.objects.create())

Here is the queryset of all incorrect bookings (i.e: [b2]):

# The following requires a single query because
# the Django ORM is pretty smart
[b.id for b in Booking.objects.exclude(
    id__in=Booking.objects.filter(
        category__in=F('agent__categories')
    )
)]
[2]

Note that in my experience the following query does not produce any error but for some unknown reason the result is not correct either:

Booking.objects.exclude(category__in=F('agent__categories'))
[]
Régis B.
  • 10,092
  • 6
  • 54
  • 90