MAriadb optimizing query for averaging

Question

I have database like this (in real there are over 30 different sKey):

+----+------+------+---------------------+
|ID  | sKey | sVal |      timestamp      |
+----+------+------+---------------------+
| 1  | temp |   19 | 2023-07-14 20:32:06 |
| 2  | humi |   60 | 2023-07-14 20:33:06 |
| 3  | temp |   20 | 2023-07-14 20:34:06 |
| 4  | humi |   65 | 2023-07-14 20:35:06 |
| 5  | pres | 1023 | 2023-07-14 20:36:06 |
| 6  | temp |   22 | 2023-07-14 20:37:06 |
| 7  | temp |   21 | 2023-07-14 20:38:06 |
| 8  | pres | 1028 | 2023-07-14 20:39:06 |
| 9  | temp |   20 | 2023-07-14 20:40:06 |
|10  | temp |   19 | 2023-07-14 20:43:06 |  <-time glitch
|11  | pres | 1022 | 2023-07-14 20:44:06 |
|12  | temp |   19 | 2023-07-14 20:45:06 |
|13  | humi |   66 | 2023-07-14 20:46:06 |
|14  | humi |   63 | 2023-07-14 20:47:06 |
|15  | temp |   19 | 2023-07-14 20:48:06 |
|16  | pres | 1029 | 2023-07-14 20:49:06 |
|20  | temp |   19 | 2023-07-14 20:50:06 | <- ID not consecutive (deleted records)
|21  | pres | 1022 | 2023-07-14 20:61:06 |
|22  | temp |   19 | 2023-07-14 20:62:06 |
|23  | pres | 1029 | 2023-07-14 20:63:06 |
+----+------+------+---------------------+

Now I wish to get averages for each sKey (with value in sVal). For now I have working solution but it's to slow. Actually I have 3 separate queries like this:

SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='temp'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)
SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='humi'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)
SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='pres'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)

As database is allready over 2000000 records now, the single query take around 3 seconds. I assume if I can somehow join query to just one as only WHERE sKey=... the result should be faster. So how to improve that.

Or maybe I have wrong approach at all. The wanted result is to get averaged sVal per each sKey. The averaging interval cant be by record numbers (ID) as some record may be deleted. Even if ID is in row there may be some record missed. So I think only interval in timestamp itself can be used. But I'm novice in SQL and I'm possible miss something.

Average time (3 minutes in example) may be any value.

Please add the DDL (`SHOW CREATE TABLE Test;`) and index stats (`SHOW INDEX FROM Test;`) to your question. How selective is your typical WHERE clause date range (rows in date range / total row count)? Is `(sKey, timestamp)` unique? How frequently do new `sKey` values get added, or are they static? — user1191247, Aug 19 '23 at 11:03

Turo · Accepted Answer · 2023-08-19T09:50:53.290

1

I don't understand why you don't group by skey, what's wrong with

SELECT AVG(`sVal`), `skey`, `timestamp` FROM `Test`
WHERE sKey in ('temp','humi','pres')  AND 
   timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06'
GROUP BY  FLOOR(TO_SECONDS(`timestamp`)/180), `skey`

and mysql is something diffent in grouping and allows your construct, shouldn't it be:

SELECT AVG(`sVal`), `skey`, FLOOR(TO_SECONDS(`timestamp`)/180) FROM `Test`
WHERE sKey in ('temp','humi','pres')  AND 
   timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06'
GROUP BY  FLOOR(TO_SECONDS(`timestamp`)/180), `skey`

edited Aug 19 '23 at 09:50

answered Aug 19 '23 at 09:01

Turo

4,724
2
14
27

2

`GROUP BY` should be after the `WHERE` clause. – Luuk Aug 19 '23 at 09:45
Sorry, my bad, didn't test it. Bad cut and paste usage – Turo Aug 19 '23 at 09:50
Different result format but execute in 2 seconds for 4 items. This is fastest for now. – eSlavko Aug 19 '23 at 09:58
use it as a subselect and you can have the result format you want – Turo Aug 19 '23 at 09:59

nbk · Answer 2 · 2023-08-19T08:28:19.740

0

You can use conditional avg.

thsi query will benefit from an Index on timestamo, also you should test if a combined index on skey and timestamp will also increase the speed

SELECT 
    AVG(IF(sKey = 'temp', `sVal`, 0)) AS `temp`,
    AVG(IF(sKey = 'humi', `sVal`, 0)) AS `humi`,
    AVG(IF(sKey = 'pres', `sVal`, 0)) AS `pres`,
    MIN(`timestamp`) As `timestamp`
FROM
    `Test`
WHERE
    timestamp BETWEEN '2023-07-14 20:34:06' AND '2023-07-14 20:51:06'
GROUP BY FLOOR(TO_SECONDS(`timestamp`) / 180);

edited Aug 19 '23 at 08:28

answered Aug 19 '23 at 07:23

nbk

45,398
8
30
47

this one works as one query but doesn't have significant time improvement. – eSlavko Aug 19 '23 at 08:20
if GROUP .. cause rescaning whole database is it possible to 1st execute timestamp BETWEEN..., to be executed 1st? – eSlavko Aug 19 '23 at 08:21
Do you use mysql 8, but it will not improve your query – nbk Aug 19 '23 at 08:26
also do you have a index on timestamp? – nbk Aug 19 '23 at 08:27
I do some benchmarks. My old way take 1.7 second per query. nbk way take 3.1 s if only one parameter is in query, and 3.9s if 4 parameter are in query. – eSlavko Aug 19 '23 at 08:36
server is 10.5.22-MariaDB, timestamp doesn't have index – eSlavko Aug 19 '23 at 08:38
tried to add index on timestamp and got way worse result. Scan for single item take 7.3s and for 4 items take 8.2s – eSlavko Aug 19 '23 at 08:47
and the combined index? – nbk Aug 19 '23 at 08:49
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254966/discussion-between-eslavko-and-nbk). – eSlavko Aug 19 '23 at 08:50

user1191247 · Answer 3 · 2023-08-21T12:03:57.937

Assuming (timestamp, sKey) is unique, you will probably get a significant improvement by dropping that surrogate PK.

Check (timestamp, sKey) is unique:

SELECT IF(COUNT(*) = COUNT(DISTINCT timestamp, sKey), 'unique', 'not unique')
FROM Test;

Create new table and populate:

CREATE TABLE `Test2` (
  `sKey` varchar(4) NOT NULL,
  `sVal` smallint unsigned NOT NULL,
  `timestamp` datetime NOT NULL,
  PRIMARY KEY (`timestamp`,`sKey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO Test2
SELECT sKey, sVal, timestamp
FROM Test;

As mentioned in my comment previously, you have not included enough detail in your question to determine what other changes may be worthwhile. For example, if most of your queries involve doing some interval manipulation to your timestamp column, and the number of rows meeting the criteria is significant, changing the timestamp column to a bigint storing the unix timestamp will further improve performance.

Again, if grouping large result sets, using a DIV b instead of FLOOR(a/b) can have a surprisingly significant impact.

Have a play with these to see what I mean:

SELECT BENCHMARK(1000000, FLOOR(TO_SECONDS('2023-02-01 13:00:00')/180));
SELECT BENCHMARK(1000000, TO_SECONDS('2023-02-01 13:00:00') DIV 180);

SELECT BENCHMARK(1000000, FLOOR(1675256400/180));
SELECT BENCHMARK(1000000, 1675256400 DIV 180);

The differences will be far more significant if your MariaDB server is resource constrained.

Did try benchmarking and indeed DIV is 3 times faster. About linux timestamp I like to avoid 2038 year problem, as I expect that my app should still live in that time. — eSlavko, Aug 25 '23 at 06:41
Storing Unix timestamps as `INT UNSIGNED` (max value 4294967295) gives a max datetime value of `2106-02-07 06:28:15`. This is a micro-optimization to remove the need for conversion to INT for your interval grouping, but would only provide worthwhile benefit if the set being grouped is large. Dropping the surrogate PK will likely have a much greater impact on performance, as you remove the meaningless INT ID, and the secondary index becomes the clustered index. — user1191247, Aug 25 '23 at 09:42

MAriadb optimizing query for averaging

3 Answers3