1

I started to investigate why some searches in the Django admin where really slow (see here). Digging further I found that MySQL (5.1, InnoDB tables) performance vary a lot from one query to another one similar. For example:

This query (looking for 'c', 'd' and 'e' in 4 fields, 2 related) generated by Django take 89 ms and return 3093 rows:

SELECT DISTINCT `donnees_artiste`.`id`
    FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
    ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
    ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T4
    ON (`donnees_artiste`.`id` = T4.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T5
    ON (T4.`evenement_id` = T5.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T6
    ON (`donnees_artiste`.`id` = T6.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T7
    ON (T6.`evenement_id` = T7.`id`)

WHERE (
    (`donnees_artiste`.`nom` LIKE '%c%'
  OR `donnees_artiste`.`prenom` LIKE '%c%'
  OR `donnees_evenement`.`cote` LIKE '%c%'
  OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
  OR `donnees_artiste`.`prenom` LIKE '%d%'
  OR T5.`cote` LIKE '%d%'
  OR T5.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%e%'
  OR `donnees_artiste`.`prenom` LIKE '%e%'
  OR T7.`cote` LIKE '%e%'
  OR T7.`titre` LIKE '%e%' )
);

If I replace the 'e' by a 'k' so it's mostly the same query, it take 8720 ms (100x increase) and return 931 rows.

SELECT DISTINCT `donnees_artiste`.`id`
    FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
    ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
    ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T4
    ON (`donnees_artiste`.`id` = T4.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T5
    ON (T4.`evenement_id` = T5.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T6
    ON (`donnees_artiste`.`id` = T6.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T7
    ON (T6.`evenement_id` = T7.`id`)

WHERE (
    (`donnees_artiste`.`nom` LIKE '%c%'
  OR `donnees_artiste`.`prenom` LIKE '%c%'
  OR `donnees_evenement`.`cote` LIKE '%c%'
  OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
  OR `donnees_artiste`.`prenom` LIKE '%d%'
  OR T5.`cote` LIKE '%d%'
  OR T5.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%k%'
  OR `donnees_artiste`.`prenom` LIKE '%k%'
  OR T7.`cote` LIKE '%k%'
  OR T7.`titre` LIKE '%k%' )
);

Both of these query give the same EXPLAIN, so no clue there.

ID  SELECT_TYPE     TABLE   TYPE    POSSIBLE_KEYS   KEY     KEY_LEN     REF     ROWS    EXTRA
1   SIMPLE  donnees_artiste     ALL     None    None    None    None    4368    Using temporary; Using filesort
1   SIMPLE  donnees_artiste_evenements  ref     artiste_id,donnees_artiste_evenements_eb99df11  artiste_id  4   mmac.donnees_artiste.id     1   Using index; Distinct
1   SIMPLE  donnees_evenement   eq_ref  PRIMARY,donnees_evenements_id_index     PRIMARY     4   mmac.donnees_artiste_evenements.evenement_id    1   Using where; Distinct
1   SIMPLE  T4  ref     artiste_id,donnees_artiste_evenements_eb99df11  artiste_id  4   mmac.donnees_artiste.id     1   Using index; Distinct
1   SIMPLE  T5  eq_ref  PRIMARY,donnees_evenements_id_index     PRIMARY     4   mmac.T4.evenement_id    1   Using where; Distinct
1   SIMPLE  T6  ref     artiste_id,donnees_artiste_evenements_eb99df11  artiste_id  4   mmac.donnees_artiste.id     1   Using index; Distinct
1   SIMPLE  T7  eq_ref  PRIMARY,donnees_evenements_id_index     PRIMARY     4   mmac.T6.evenement_id    1   Using where; Distinct

Also if I do a COUNT on the first query it take 11200 ms.

SELECT COUNT(DISTINCT `donnees_artiste`.`id`)
    FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
    ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
    ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T4
    ON (`donnees_artiste`.`id` = T4.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T5
    ON (T4.`evenement_id` = T5.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T6
    ON (`donnees_artiste`.`id` = T6.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T7
    ON (T6.`evenement_id` = T7.`id`)

WHERE (
    (`donnees_artiste`.`nom` LIKE '%c%'
  OR `donnees_artiste`.`prenom` LIKE '%c%'
  OR `donnees_evenement`.`cote` LIKE '%c%'
  OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
  OR `donnees_artiste`.`prenom` LIKE '%d%'
  OR T5.`cote` LIKE '%d%'
  OR T5.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%e%'
  OR `donnees_artiste`.`prenom` LIKE '%e%'
  OR T7.`cote` LIKE '%e%'
  OR T7.`titre` LIKE '%e%' )
);

My innodb_buffer_pool_size is set high. I have indexes on all relevant fields and on primary keys and I already optimized my tables.

So, why the first query is so fast and the 2 others so slow? These 3 queries are just examples. Many time I'm just changing or removing one character from a query and it made big difference on the query time. But I can't see any pattern.

UPDATE

The performance problem definitely come from how Django generate these queries. All these redundant LEFT OUTER JOIN chained together kill the performance. At the moment it's not totally clear to me if it's a bug in the Django SQL generator, a bug in how the query is built for the search field or if all that work as expected by the Django developers. I'm still investigating but there's, at least, one strange thing in the Django behavior...

If I run this query (that is not necessarily equivalent to the second one, but not far) the results come pretty fast (161 ms, no cache):

SELECT DISTINCT `donnees_artiste`.`id`
    FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
    ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
    ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)

WHERE (
    (`donnees_artiste`.`nom` LIKE '%c%'
  OR `donnees_artiste`.`prenom` LIKE '%c%'
  OR `donnees_evenement`.`cote` LIKE '%c%'
  OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
  OR `donnees_artiste`.`prenom` LIKE '%d%'
  OR `donnees_evenement`.`cote` LIKE '%d%'
  OR `donnees_evenement`.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%k%'
  OR `donnees_artiste`.`prenom` LIKE '%k%'
  OR `donnees_evenement`.`cote` LIKE '%k%'
  OR `donnees_evenement`.`titre` LIKE '%k%' )
);

SECOND UPDATE

Finally that's not a bug in Django, I'm pretty sure it's the desired behavior. The idea is, on a multi-terms search, the search of the next term is done on the subset return by the previous term so, for the related fields, all the terms don't have to be in the same row to have a match. For this, the DB have to create temporary table with each subsets and scan it. That explain why there can be a lot a variation because if the first term match only a few rows, the temporary table will be small, and search of subsequent term will be fast (because they will be done on a small table). The difference between the two queries is subtle but the Django query can return more matches in general.

Community
  • 1
  • 1
Etienne
  • 12,440
  • 5
  • 44
  • 50
  • 1
    Fast on 'e', slow on 'k' - it should be faster with letters that occur at the front of words, I imagine it would only scan as far in to the word as need be. Those middle character `LIKE` commands are rarely efficient. – Orbling Mar 12 '12 at 21:00
  • Your're right for the letters that occur at the front of words. See my comment to `newtover`. For those middle character LIKE commands, I don't think there' any alternative if you want to search arbitrary strings in text? – Etienne Mar 12 '12 at 21:32
  • InnoDB really isn't setup for reading text fields like that. If I desperately need to do it, I usually shadow the data on to a MyISAM table and use a [`FULLTEXT` index with `MATCH()`](http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html). Though that's no good for single letters, as it usually stops at 4-letter words. Server-wide setting to change it and makes it inefficient. – Orbling Mar 12 '12 at 22:07
  • If you know in advance what letters need to be scanned, do the workload beforehand, on insert, use flags for the letters you need to check if it's a small set. If it's just letters from the alphabet, you can use a long integer and set bits, then use a bitwise AND [`&`](http://dev.mysql.com/doc/refman/5.1/en/bit-functions.html#operator_bitwise-and) with the character you are looking for. Still requires an operation on each row, but far quicker than a string search. – Orbling Mar 12 '12 at 22:11

3 Answers3

2

I think, the answer is that e in most cases is located at the beginning of the scanned strings and in the first searched string, allowing to short cirquit the OR conditions, while matches for k happen in the last conditions and somewhere in the end of the string. And since there are significantly less rows with k, more strings should be full scanned without any matches.

newtover
  • 31,286
  • 11
  • 84
  • 89
  • You're right. One of my 4 fields almost start by 'E'. So, basically it means that how far MySQL have to search, the longest it will take. So, if the query return no result, this query will be definitely long, because it will have to scan everything. – Etienne Mar 12 '12 at 21:21
2

If you use LIKE pattern with leading wildcard, your query will not benefit index. Using LIKE in this manner could be highly inefficient and its execution time can vary quite a lot. WHY?

  1. The algorithm behind LIKE statement stops searching a row in case it comes across a match.
  2. In this scenario (using no index) MySQL applies some other boosting algorithms that may or may not be applicable in some cases.

Why using COUNT in your third query slows it so much?

I see you are using innoDB.

innoDB doesnt read number of rows from stored/cached value like MyISAM does (if column is NOT NULL) cause innoDB is more optimized for 'writing' than 'reading' (opposing to MyISAM). Using COUNT on innoDB table perform either full table scan or full index scan every time you do it.

Your queries do not use any index and that is may be the worst case so full table scan happens (and Yes it is as slow as it sounds).

I thought you might be interested in : MySQL Indexes

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Boris D. Teoharov
  • 2,319
  • 4
  • 30
  • 49
1

Conditions of the type:

WHERE column LIKE '%c%'

cannot use an index on column. So, these columns have to be fully scanned.

You have multiple such conditions, you're using OR between them (which assures that all these tables will be scanned). And last, you are (rather: Django is) adding DISTINCT which probably requires a final filesort before returning the results.

I can't find an explanantion for the vast difference in performance (100x). Perhaps the first query was cached. Can you try adding ORDER NY NULL in the end of the queries and time them?

The generated query is also not very well designed because it will probably end in a mini-Cartesian-Join. You are joining a base table to multiple tables are in 1-to-many relationship with the base table. That's a reason for poor performance and the query plan would help clarify that.

ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
  • See `newtover` answer for the vast difference explanation. I understand and agree that the generated query is not well design but what do you mean by 'mini-Cartesian-Join'? And by 'the query plan'? – Etienne Mar 12 '12 at 21:38
  • Cartesian join means that if for an artist there are (on average) 10 `donnees_evenement` rows, then the query (without the WHERE conditions) would return `10x10x10xNumberOfArtists = 1000xNumberOfArtists` rows. The query planner may actually generate such a plan that generates a temp table with so many rows and then checks the WHERE condistions (the complicated OR-AND conditions and DINSTINCT may not allow it to generate a better plan). – ypercubeᵀᴹ Mar 12 '12 at 21:43
  • I added the `EXPLAIN` of the 2 first queries. – Etienne Mar 12 '12 at 21:46