1

I have a problem when IN clause contains too many values. Consider this query

EXPLAIN
SELECT DISTINCT t.entry_id , t.sticky , wd.field_id_104 , t.title
FROM exp_channel_titles AS t
LEFT JOIN exp_channels ON t.channel_id = exp_channels.channel_id
LEFT JOIN exp_channel_data AS wd ON t.entry_id = wd.entry_id
LEFT JOIN exp_members AS m ON m.member_id = t.author_id
INNER JOIN exp_category_posts ON t.entry_id = exp_category_posts.entry_id
INNER JOIN exp_categories ON exp_category_posts.cat_id = exp_categories.cat_id
WHERE t.entry_id !=''
  AND t.site_id IN ('1')
  AND t.entry_date < 1610109517
  AND (t.expiration_date = 0 OR t.expiration_date > 1610109517)
  AND t.entry_id  IN ('0','649','650','651','652','653','654','655')
;

if there are few values output is following, which is ok optimization ok

but if IN ('0','649','650','651','652','653','654','655', thousand values) query run about 1 minute and explain change to this

failed

how to fix that?

UPDATE: range_optimizer_max_mem_size had already set to 0 and isn't issue enter image description here

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Oleksandr IY
  • 2,783
  • 6
  • 32
  • 53

2 Answers2

3

We have had similar problems at my company when someone runs a query with a very long list of values in an IN (...) predicate.

We found that MySQL enforces a limit on memory available to the range optimizer. If the list of values is too long, it exceeds the memory limit, and the optimizer cannot finish its analysis to see if it should use the index. So it gives up and says, "forget it! it's a table-scan for you."

We fix it by setting the MySQL Server configuration value range_optimizer_max_mem_size=0 which means there is no limit to the memory that the range optimizer can use.

This creates a risk that if someone were to run a query with a million values in the IN (...) list, it could use a lot of memory, maybe enough to kill the MySQL Server. But so far the tradeoff is preferable, to allow the optimizer to choose the index.

See documentation:


Re your comment:

Another common reason for the optimizer to choose to do a table-scan is that it calculates that your conditions match a large enough portion of the table that it's more expensive to use the index than to simply run a table-scan and examine every row.

The threshold for this isn't documented, and it depends on the implementation of the cost-based optimizer, so it might change from version to version. But my observation is that usually if your conditions match more than 20% of the table, the optimizer chooses the table-scan.

You could use an index hint to tell the optimizer to treat a table-scan as infinitely expensive, so the index is preferred to a table-scan.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • I already did that before posting this thread, that fix didn't help me – Oleksandr IY Jan 09 '21 at 18:14
  • Reply to update: I added `FROM exp_channel_titles AS t USE INDEX (entry_date,expiration_date,site_id)` but seems haven't solved the problem, only the difference explain works faster and no PRIMARY at possible_keys – Oleksandr IY Jan 09 '21 at 19:45
1

Explode-implode. This is a classic problem of an inefficient way to write a query.

  1. JOIN several tables
  2. Filter
  3. Collapse the results -- usually by GROUP BY or LIMIT, but DISTINCT has the same effect.

So... Turn the query inside out.

  1. Find the ids of the desired rows in t

  2. JOIN that to the rest of the tables.

  3. Presumably the DISTINCT will not be needed at all.

     SELECT  t2.entry_id, t2.sticky, wd.field_id_104, t2.title
         FROM ( SELECT id
              FROM  exp_channel_titles
              WHERE entry_id !=''
               AND site_id IN ('1')
               AND entry_date < 1610109517
               AND (expiration_date = 0 OR expiration_date > 1610109517)
               AND entry_id  IN ('0','649','650','651','652','653','654','655')
              ) AS t
         JOIN exp_channel_titles AS t2  USING(id)
         LEFT JOIN  exp_channels  ON t2.channel_id = exp_channels.channel_id
         LEFT JOIN  exp_channel_data AS wd  ON t2.entry_id = wd.entry_id
     ;
    

Another reformulation

Since there is only one use for md, this might be better:

SELECT  entry_id,
        sticky,
        ( SELECT wd.field_id_104
            FROM exp_channels  ON t2.channel_id = exp_channels.channel_id
            LEFT JOIN  exp_channel_data AS wd  ON t.entry_id = wd.entry_id
        ) AS field_id_104,
        title
    FROM  exp_channel_titles
    WHERE entry_id !=''
      AND site_id IN ('1')
      AND entry_date < 1610109517
      AND (expiration_date = 0 OR expiration_date > 1610109517)
      AND entry_id  IN ('0','649','650','651','652','653','654','655')
;

and have a 5-column index starting with site_id, entry_date

Other...

AND (t.expiration_date = 0 OR t.expiration_date > 1610109517)

OR is not sargeable. Can you redesign the table to avoid this OR?

Without the above reformulation, this may help:

INDEX(site_id, entry_date)

Also, get rid of these, since they seem to be totally useless:

LEFT JOIN exp_channels ON t.channel_id = exp_channels.channel_id
LEFT JOIN  exp_members AS m  ON m.member_id = t.author_id

And these may be useless:

    INNER JOIN  exp_category_posts  ON t.entry_id = exp_category_posts.entry_id
    INNER JOIN  exp_categories  ON exp_category_posts.cat_id = exp_categories.cat_id
Rick James
  • 135,179
  • 13
  • 127
  • 222