django taggit similar_objects very slow query

Question

I am trying to select the 3 most recently published items, with any tags similar to the current item (and some other filters too) Can't find an efficient way to do it, there are a lot of 'item' in the DB.

from taggit_autosuggest.managers import TaggableManager

class Item(models.Model):
    publish_date = DateField()
    tags = TaggableManager()
    sites = ManyToManyField(Site)

def my_view():
    ...
    current_item = #get current item
    related_items = Item.active_objects.filter(
        sites=current_site,
        id__in=[x.id for x in current_item.tags.similar_objects()]
        ).order_by('-publish_date')[:3]
    ...

But this causes quite big performance issues, from the similar_objects() method. exponentially worse the more tags the current_item has

# Query_time: 20.613503  Lock_time: 0.000182 Rows_sent: 83  Rows_examined: 7566504
SELECT `taggit_taggeditem`.`content_type_id`, `taggit_taggeditem`.`object_id`, COUNT(`taggit_taggeditem`.`id`) AS `n` FROM `taggit_taggeditem` WHERE (NOT (`taggit_taggeditem`.`object_id` = 205636  AND `taggit_taggeditem`.`content_type_id`
 = 11 ) AND (`taggit_taggeditem`.`tag_id`) IN (SELECT DISTINCT `taggit_tag`.`id` FROM `taggit_tag` INNER JOIN `taggit_taggeditem` ON ( `taggit_tag`.`id` = `taggit_taggeditem`.`tag_id` ) WHERE (`taggit_taggeditem`.`object_id` = 205636  AND
 `taggit_taggeditem`.`content_type_id` = 11 ))) GROUP BY `taggit_taggeditem`.`content_type_id`, `taggit_taggeditem`.`object_id` ORDER BY `n` DESC;

I've also tried not using the similar objects method

related_items = Item.active_objects.filter(
    sites=current_site,
    tags__in=current_item.tags.all()).exclude(slug=slug).order_by('-publish_date').distinct()[:3]
    context['tagged'] = tags.order_by('-publish_date').distinct()[:3]

which was consistently worse (some queries up to 120s, yuck)

What is the 'nice' way to do this?!

score 9 · Accepted Answer · answered Apr 01 '15 at 20:31

My hypothesis was that getting the tags, and using the tag-> item relation would be more efficient than searching all items. So we build a queryset of all TaggedItem, get the ID of all the objects, then perform our filter.

from taggit.models import TaggedItem
related_items = TaggedItem.objects.none()
for tag in current_item.tags.all():
    #build queryset of all TaggedItems
    related_items |= tag.taggit_taggeditem_items.all()

#TaggedItem doesn't have a direct link to the object, have to grab ids
ids = related_items.values_list('object_id', flat=True)
return Item.objects.filter(id__in=ids, sites=current_site).exclude(id=item.id).order_by('-publish_date')[:3]

score 0 · Answer 2 · answered Apr 01 '15 at 02:06

0

I think you could do it like this:

related_items = current_item.tags.similar_objects().filter(
    sites=current_site,
).order_by('-publish_date')[:3]

Though I think you'd have to include the logic that's behind active_objects again in that filter.

answered Apr 01 '15 at 02:06

schillingt

13,493
2
32
34

Yes, but the whole point is to avoid the similar_objects() method. Because it is the cause of the query above, which is exponentially bad the more tags the current story has – straykiwi Apr 01 '15 at 03:36
That query you gave has the ```similar_objects()``` call being used along with ```__in```. If ```similar_objects()``` returns a lot of objects that ```in``` is going to perform terribly. What I've suggested is to use joins rather than an ```in```. Give it a try and see how it works out. – schillingt Apr 01 '15 at 11:28
http://django-taggit.readthedocs.org/en/latest/api.html#TaggableManager.similar_objects The reason is because similar_objects returns a list not a queryset. If you look at the implementation, its pretty bad. I've come up with a solution that performs much much better, its not a work of art though so I won't accept it in case something better comes up – straykiwi Apr 01 '15 at 20:24

score 0 · Answer 3 · answered Mar 28 '18 at 19:34

I'm using the below (I want % overlap, hence the hocus-pocus with job_tags_len).

job = Job.objects.get(id=job_id)
job_tags = list(job.tags.names())
job_tags_len = 100 / len(job_tags)
filtered = Worker.objects.filter(tags__name__in=job_tags).annotate(overlap=Count('id') * job_tags_len).order_by('-overlap').distinct()

django taggit similar_objects very slow query

3 Answers3