13

My query has a join, and it looks like it's using two indexes which makes it more complicated. I'm not sure if I can improve on this, but I thought I'd ask.

The query produces a list of records with similar keywords the record being queried.

Here's my query.

SELECT match_keywords.padid,
       COUNT(match_keywords.word) AS matching_words
FROM   keywords current_program_keywords
       INNER JOIN keywords match_keywords
         ON match_keywords.word = current_program_keywords.word
WHERE  match_keywords.word IS NOT NULL
       AND current_program_keywords.padid = 25695
GROUP  BY match_keywords.padid
ORDER  BY matching_words DESC
LIMIT  0, 11  

The EXPLAIN alt text

Word is varchar(40).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jules
  • 7,568
  • 14
  • 102
  • 186
  • Clarification required: you say similar but in your query you use `ON match_keywords.word = current_program_keywords.word` means both words are the same. Now, if record #25695.word = `aaa`, the query will return `n` rows that have word = `aaa`... the GROUP BY and ORDER BY does not make any sense. Is `n` the only variable you're interested in? – Salman A Jan 20 '11 at 08:16
  • Similar, year as it matches as many words as possible and provides a count of the most matches, sorting on that gives the most similar records. – Jules Jan 20 '11 at 08:18

3 Answers3

9

You can start by trying to remove the IS NOT NULL test, which is implicitly removed by COUNT on the field. It also looks like you would want to omit 25695 from match_keywords, otherwise 25695 (or other) would surely show up as the "best" match within your 11 row limit?

SELECT     match_keywords.padid,
           COUNT(match_keywords.word) AS matching_words
FROM       keywords current_program_keywords
INNER JOIN keywords match_keywords
        ON match_keywords.word = current_program_keywords.word
WHERE      current_program_keywords.padid = 25695
GROUP BY   match_keywords.padid
ORDER BY   matching_words DESC
LIMIT      0, 11

Next, consider how you would do it as a person.

  • You would to start with a padid (25695) and retrieve all the words for that padid
  • From those list of words, go back into the table again and for each matching word, get their padid's (assumed to have no duplicate on padid + word)
  • group the padid's together and count them
  • order the counts and return the highest 11

With your list of 3 separate single-column indexes, the first two steps (both involve only 2 columns) will always have to jump from index back to data to get the other column. Covering indexes may help here - create two composite indexes to test

create index ix_keyword_pw on keyword(padid, word);
create index ix_keyword_wp on keyword(word, padid);

With these composite indexes in place, you can remove the single-column indexes on padid and word since they are covered by these two.

Note: You always have to temper SELECT performance against

  • size of indexes (the more you create the more to store)
  • insert/update performance (the more indexes, the longer it takes to commit since it has to update the data, then update all indexes)
RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • The way you would do it as a person is not necessarily the way the optimizer will choose; I definitely remember SQL Server processing a simple query of mines, exactly the opposite way of the naive logic. – pascal Jan 27 '11 at 16:52
  • But to go further in that direction we would need a better understanding of the number of padid, word, word/padid... If I read the EXPLAIN properly, for 25695, we get 8 words, and only 5 lines matching these 8 words... which is probably not the average case. – pascal Jan 27 '11 at 16:54
  • 1
    You have an error in reasoning. `remove the IS NOT NULL test, which is implicitly removed by COUNT` is not correct, the IS NOT NULL is redundant due to inner join on `match_keywords.word = current_program_keywords.word` – David Mårtensson Feb 07 '11 at 16:14
5

Try the following... ensure index on PadID, and one on WORD. Then, by changing the order of the SELECT WHERE qualifier should optimize on the PADID of the CURRENT keyword first, then join to the others... Exclude a join to itself. Also, since you were checking on equality on the inner join to matching keywords... if the current keyword is checked for null, it should never join to a null value, thus eliminating a compare on the MATCH keywords alias as looking at every comparison as looking for NULL...

SELECT STRAIGHT_JOIN
      match_keywords.padid,
      COUNT(*) AS matching_words 
   FROM
      keywords current_program_keywords
         INNER JOIN keywords match_keywords          
            ON match_keywords.word = current_program_keywords.word 
            and match_keywords.padid <> 25695
   WHERE  
          current_program_keywords.padid = 25695
      AND current_program_keywords.word IS NOT NULL
   GROUP BY 
      match_keywords.padid 
   ORDER BY 
      matching_words DESC 
   LIMIT
      0, 11 
DRapp
  • 47,638
  • 12
  • 72
  • 142
1

You should index the following fields (check to what table corresponds)

match_keyword.padid

current_program_keywords.padid

match_keyword.words

current_program_keywords.words

Hope it helps accelerate

mariana soffer
  • 1,853
  • 12
  • 17