0

I have database like this (in real there are over 30 different sKey):

+----+------+------+---------------------+
|ID  | sKey | sVal |      timestamp      |
+----+------+------+---------------------+
| 1  | temp |   19 | 2023-07-14 20:32:06 |
| 2  | humi |   60 | 2023-07-14 20:33:06 |
| 3  | temp |   20 | 2023-07-14 20:34:06 |
| 4  | humi |   65 | 2023-07-14 20:35:06 |
| 5  | pres | 1023 | 2023-07-14 20:36:06 |
| 6  | temp |   22 | 2023-07-14 20:37:06 |
| 7  | temp |   21 | 2023-07-14 20:38:06 |
| 8  | pres | 1028 | 2023-07-14 20:39:06 |
| 9  | temp |   20 | 2023-07-14 20:40:06 |
|10  | temp |   19 | 2023-07-14 20:43:06 |  <-time glitch
|11  | pres | 1022 | 2023-07-14 20:44:06 |
|12  | temp |   19 | 2023-07-14 20:45:06 |
|13  | humi |   66 | 2023-07-14 20:46:06 |
|14  | humi |   63 | 2023-07-14 20:47:06 |
|15  | temp |   19 | 2023-07-14 20:48:06 |
|16  | pres | 1029 | 2023-07-14 20:49:06 |
|20  | temp |   19 | 2023-07-14 20:50:06 | <- ID not consecutive (deleted records)
|21  | pres | 1022 | 2023-07-14 20:61:06 |
|22  | temp |   19 | 2023-07-14 20:62:06 |
|23  | pres | 1029 | 2023-07-14 20:63:06 |
+----+------+------+---------------------+

Now I wish to get averages for each sKey (with value in sVal). For now I have working solution but it's to slow. Actually I have 3 separate queries like this:

SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='temp'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)
SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='humi'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)
SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='pres'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)

As database is allready over 2000000 records now, the single query take around 3 seconds. I assume if I can somehow join query to just one as only WHERE sKey=... the result should be faster. So how to improve that.

Or maybe I have wrong approach at all. The wanted result is to get averaged sVal per each sKey. The averaging interval cant be by record numbers (ID) as some record may be deleted. Even if ID is in row there may be some record missed. So I think only interval in timestamp itself can be used. But I'm novice in SQL and I'm possible miss something.

Average time (3 minutes in example) may be any value.

nbk
  • 45,398
  • 8
  • 30
  • 47
eSlavko
  • 348
  • 3
  • 12
  • Please add the DDL (`SHOW CREATE TABLE Test;`) and index stats (`SHOW INDEX FROM Test;`) to your question. How selective is your typical WHERE clause date range (rows in date range / total row count)? Is `(sKey, timestamp)` unique? How frequently do new `sKey` values get added, or are they static? – user1191247 Aug 19 '23 at 11:03

3 Answers3

1

I don't understand why you don't group by skey, what's wrong with

SELECT AVG(`sVal`), `skey`, `timestamp` FROM `Test`
WHERE sKey in ('temp','humi','pres')  AND 
   timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06'
GROUP BY  FLOOR(TO_SECONDS(`timestamp`)/180), `skey`

and mysql is something diffent in grouping and allows your construct, shouldn't it be:

SELECT AVG(`sVal`), `skey`, FLOOR(TO_SECONDS(`timestamp`)/180) FROM `Test`
WHERE sKey in ('temp','humi','pres')  AND 
   timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06'
GROUP BY  FLOOR(TO_SECONDS(`timestamp`)/180), `skey`
Turo
  • 4,724
  • 2
  • 14
  • 27
0

You can use conditional avg.

thsi query will benefit from an Index on timestamo, also you should test if a combined index on skey and timestamp will also increase the speed

SELECT 
    AVG(IF(sKey = 'temp', `sVal`, 0)) AS `temp`,
    AVG(IF(sKey = 'humi', `sVal`, 0)) AS `humi`,
    AVG(IF(sKey = 'pres', `sVal`, 0)) AS `pres`,
    MIN(`timestamp`) As `timestamp`
FROM
    `Test`
WHERE
    timestamp BETWEEN '2023-07-14 20:34:06' AND '2023-07-14 20:51:06'
GROUP BY FLOOR(TO_SECONDS(`timestamp`) / 180);
nbk
  • 45,398
  • 8
  • 30
  • 47
0

Assuming (timestamp, sKey) is unique, you will probably get a significant improvement by dropping that surrogate PK.

Check (timestamp, sKey) is unique:

SELECT IF(COUNT(*) = COUNT(DISTINCT timestamp, sKey), 'unique', 'not unique')
FROM Test;

Create new table and populate:

CREATE TABLE `Test2` (
  `sKey` varchar(4) NOT NULL,
  `sVal` smallint unsigned NOT NULL,
  `timestamp` datetime NOT NULL,
  PRIMARY KEY (`timestamp`,`sKey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO Test2
SELECT sKey, sVal, timestamp
FROM Test;

As mentioned in my comment previously, you have not included enough detail in your question to determine what other changes may be worthwhile. For example, if most of your queries involve doing some interval manipulation to your timestamp column, and the number of rows meeting the criteria is significant, changing the timestamp column to a bigint storing the unix timestamp will further improve performance.

Again, if grouping large result sets, using a DIV b instead of FLOOR(a/b) can have a surprisingly significant impact.

Have a play with these to see what I mean:

SELECT BENCHMARK(1000000, FLOOR(TO_SECONDS('2023-02-01 13:00:00')/180));
SELECT BENCHMARK(1000000, TO_SECONDS('2023-02-01 13:00:00') DIV 180);

SELECT BENCHMARK(1000000, FLOOR(1675256400/180));
SELECT BENCHMARK(1000000, 1675256400 DIV 180);

The differences will be far more significant if your MariaDB server is resource constrained.

user1191247
  • 10,808
  • 2
  • 22
  • 32
  • Did try benchmarking and indeed DIV is 3 times faster. About linux timestamp I like to avoid 2038 year problem, as I expect that my app should still live in that time. – eSlavko Aug 25 '23 at 06:41
  • Storing Unix timestamps as `INT UNSIGNED` (max value 4294967295) gives a max datetime value of `2106-02-07 06:28:15`. This is a micro-optimization to remove the need for conversion to INT for your interval grouping, but would only provide worthwhile benefit if the set being grouped is large. Dropping the surrogate PK will likely have a much greater impact on performance, as you remove the meaningless INT ID, and the secondary index becomes the clustered index. – user1191247 Aug 25 '23 at 09:42