How to optimize the response time of this SQL query of a large table?

Question

I have a medium-large table (around 350000 entries and growing up). I need to fetch each last entry for the pair dev_id and var_id from the table. I actually I can fetch them, but the query takes around 20 seconds, and that it's unacceptable for my purposes.

I'm trying the next query on a MySQL server with MariaDB:

select d.dev_id, d.var_id, d.ts, d.value from data_table d 
                where d.ts > NOW() - INTERVAL 2 DAY
                and ts = (SELECT MAX(ts) FROM data_table
                                  WHERE dev_id = d.dev_id
                                    AND var_id = d.var_id)
                ORDER BY  d.dev_id

The table has a structure like down below:

id  | dev_id | frame_number | var_id | value | ts
1   |    2   |      1       |    2   | 65.5  | 2019-10-10 19:56:05
2   |    3   |      5       |    4   | 23    | 2019-10-10 20:56:06
3   |    2   |      1       |    2   | 65.5  | 2019-10-10 20:59:30
.   |    .   |      .       |    .   |   .   |    .
.   |    .   |      .       |    .   |   .   |    .
.   |    .   |      .       |    .   |   .   |    .
300k|    5   |      100     |    7   | -15.23| 2020-10-10 20:59:30

I need to get a faster response for a similar query, but my experience is not enough to detect the bottleneck in the query

EDIT 1: I cant ommit the ORDER BY but the improvement ommiting that is low (20 seconds versus 18.5 seconds)

EDIT 2: data_table schema

EDIT 3 and how to fix it: Added (dev_id, var_id and ts) as index (index based on multiple columns). The query now only needs 0.6 seconds

@manespgav Your ID should alkready be indexed, it's the primary key. Can you please show us your full schema including indexes? `describe data_table` would be a good start. `explain ` is also helpful. — Schwern, Oct 10 '19 at 18:08
This form of the "groupwise max" problem is O(N*N) in MySQL, hence terribly slow for large tables. (MariaDB is faster, but still not the fastest.) See this for faster ways: http://mysql.rjweb.org/doc.php/groupwise_max — Rick James, Jan 21 '20 at 04:32

score 4 · Accepted Answer · answered Oct 10 '19 at 18:04

For this query:

select d.dev_id, d.var_id, d.ts, d.value
from data_table d 
where d.ts > NOW() - INTERVAL 2 DAY and
      ts = (SELECT MAX(d2.ts)
            FROM data_table d2
            WHERE d2.dev_id = d.dev_id AND d2.var_id = d.var_id
           )
ORDER BY d.dev_id;

I would recommend two indexes:

data_table(ts, dev_id, var_id, value)
data_table(dev_id, var_id, ts)

The first is a covering index for the outer query. The second is a covering index for the inner query.

This may help -- assuming you have lots of duplicates. If not, the outer order by may be the cause of the performance issues. That would be a little harder to fix.

Does the first index need to include `var_id` and `value`? – Schwern Oct 10 '19 at 18:09 — Schwern, Oct 10 '19 at 18:09
@Schwern value is not a real index, just the data value – manespgav Oct 10 '19 at 18:12 — manespgav, Oct 10 '19 at 18:12

score 1 · Answer 2 · answered Oct 10 '19 at 18:28

I tend to blame correlated subqueries in these instances, though the indexes already suggested will help in both scenarios:

SELECT d.dev_id, d.var_id, d.ts, d.value 
FROM (
   SELECT dev_id, var_id, MAX(ts) AS ts
   FROM data_table
   WHERE ts >  NOW() - INTERVAL 2 DAY
   GROUP BY dev_id, var_id
) AS lastTS
INNER JOIN data_table AS d 
    ON lastTS.dev_id = d.dev_id AND lastTS.var_id = d.var_id AND lastTS.ts = d.ts
    -- or, alternatively, USING (dev_id, var_id, ts)
ORDER BY d.dev_id
;

Your current, correlated subquery is being executed ("under the hood") separately for each entry over the last two days' (dev_id, var_id); possibly even repeatedly for multiple recent entries with the same (dev_id, var_id). The version I am suggesting calculates the max values for each (var_id, dev_id) that occurred in the last 2 days once, and then joins those to the table to find the full records.

If the outer query has few intermediate results, a more focused correlated query can be faster than a non-correlated one over a large amount of data; but if there is a significant amount of intermediates results, and/or a correlated subquery would not significantly reduce the cost of a non-correlated version, I find the non-correlated versions work better.

How to optimize the response time of this SQL query of a large table?

2 Answers2