-1

I have a medium-large table (around 350000 entries and growing up). I need to fetch each last entry for the pair dev_id and var_id from the table. I actually I can fetch them, but the query takes around 20 seconds, and that it's unacceptable for my purposes.

I'm trying the next query on a MySQL server with MariaDB:

select d.dev_id, d.var_id, d.ts, d.value from data_table d 
                where d.ts > NOW() - INTERVAL 2 DAY
                and ts = (SELECT MAX(ts) FROM data_table
                                  WHERE dev_id = d.dev_id
                                    AND var_id = d.var_id)
                ORDER BY  d.dev_id

The table has a structure like down below:

id  | dev_id | frame_number | var_id | value | ts
1   |    2   |      1       |    2   | 65.5  | 2019-10-10 19:56:05
2   |    3   |      5       |    4   | 23    | 2019-10-10 20:56:06
3   |    2   |      1       |    2   | 65.5  | 2019-10-10 20:59:30
.   |    .   |      .       |    .   |   .   |    .
.   |    .   |      .       |    .   |   .   |    .
.   |    .   |      .       |    .   |   .   |    .
300k|    5   |      100     |    7   | -15.23| 2020-10-10 20:59:30

I need to get a faster response for a similar query, but my experience is not enough to detect the bottleneck in the query

EDIT 1: I cant ommit the ORDER BY but the improvement ommiting that is low (20 seconds versus 18.5 seconds)

EDIT 2: data_table schema data_table schema

EDIT 3 and how to fix it: Added (dev_id, var_id and ts) as index (index based on multiple columns). The query now only needs 0.6 seconds data_table schema for the solution

Mukyuu
  • 6,436
  • 8
  • 40
  • 59
manespgav
  • 167
  • 1
  • 9

2 Answers2

4

For this query:

select d.dev_id, d.var_id, d.ts, d.value
from data_table d 
where d.ts > NOW() - INTERVAL 2 DAY and
      ts = (SELECT MAX(d2.ts)
            FROM data_table d2
            WHERE d2.dev_id = d.dev_id AND d2.var_id = d.var_id
           )
ORDER BY d.dev_id;

I would recommend two indexes:

  • data_table(ts, dev_id, var_id, value)
  • data_table(dev_id, var_id, ts)

The first is a covering index for the outer query. The second is a covering index for the inner query.

This may help -- assuming you have lots of duplicates. If not, the outer order by may be the cause of the performance issues. That would be a little harder to fix.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

I tend to blame correlated subqueries in these instances, though the indexes already suggested will help in both scenarios:

SELECT d.dev_id, d.var_id, d.ts, d.value 
FROM (
   SELECT dev_id, var_id, MAX(ts) AS ts
   FROM data_table
   WHERE ts >  NOW() - INTERVAL 2 DAY
   GROUP BY dev_id, var_id
) AS lastTS
INNER JOIN data_table AS d 
    ON lastTS.dev_id = d.dev_id AND lastTS.var_id = d.var_id AND lastTS.ts = d.ts
    -- or, alternatively, USING (dev_id, var_id, ts)
ORDER BY d.dev_id
;

Your current, correlated subquery is being executed ("under the hood") separately for each entry over the last two days' (dev_id, var_id); possibly even repeatedly for multiple recent entries with the same (dev_id, var_id). The version I am suggesting calculates the max values for each (var_id, dev_id) that occurred in the last 2 days once, and then joins those to the table to find the full records.

If the outer query has few intermediate results, a more focused correlated query can be faster than a non-correlated one over a large amount of data; but if there is a significant amount of intermediates results, and/or a correlated subquery would not significantly reduce the cost of a non-correlated version, I find the non-correlated versions work better.

Uueerdo
  • 15,723
  • 1
  • 16
  • 21