AWS RDS MySql Simple query with composite index and date range took too long to execute out of ~8 millions data

Question

Query is really simple i.e

SELECT 
    col1 , date_col 
FROM table USE INDEX (device_date_col) 
WHERE 
    device_id = "some_value"
    AND date_col BETWEEN "2020-03-16 00:00:00" and "2020-04-16 00:00:00" 
limit 1000000 ;

but it takes 30 to 60 seconds to finally returns the result, when running first time. And then it returns result under 10 seconds. And another problem is that when I change the device_id it again takes long time. I cannot understand why it's happening beside using proper indexing.

We know that, API Gateway has 30 seconds limit due to this our API encounter timeout. It happens suddenly from today.

Main goal is to retrieve minutely data, it returns less data but also takes long time i.e

 ....
 AND col1 IS NOT NULL
 GROUP BY 
    DATE(date_col),
    HOUR(date_col), 
    MINUTE(date_col)

Below are some useful info

AWS RDS having instance db.m4.large (vCPU 2 and RAM 8GB).
MySql version 5.6.x
composite index on date_col and device_col
using InnoDB
table has no id field (primary key)
total rows in table are 7.5 million
each device has data every 3 seconds
query return rows around 600k
explain query shows it is using indexing

UPDATE

MySql Workbench shows that when I run query without group by it takes 2 seconds to execute but > 30 seconds to retrieve and when I use group by then server takes > 30 to execute but 2 seconds to retrieve. ~~I think we need to more~~

~~CPU for processed data using group by~~
~~More RAM for extracting all data (without group by)~~

Below Image is showing the query response without group by. Look at the duration/Fetch time

Have you tried not using the index? You are on the edge of it being useful in your case, see e.g. [here](https://stackoverflow.com/a/57347355). Try to remove `USE INDEX (device_date_col)`, or, most promising, include col1 in your index, e.g. add an index `(device_id, date_col, col1)`. That the 2nd run is faster than the first run is due to the data being in ram, and not having to be read from disk. You cannot do much about it apart from increasing the buffer pool (which probably requires you to get more ram), or maybe making sure all data is in memory (by using it before). — Solarflare, Apr 16 '20 at 23:20
@Solarflare I've updated the question, can you please view and suggest me about my assumptions? — danish, Apr 17 '20 at 06:52
We have make sure that the first query should not timeout i.e it must be under 30 seconds. — danish, Apr 17 '20 at 06:54
SQL_NO_CACHE is not working as expected. Everytime I run the query and the time is always lesser than the previous making me believe that it is still showing cached result. also tried RESET QUERY CACHE; FLUSH QUERY CACHE; — danish, Apr 17 '20 at 06:58
Do you run the workbench on the same machine as your database? Otherwise the fetch time could be meaningless (or rather: a different place to optimize). So test or clarify this first. Also: have you tested the proposed index? For a test, run your query with `SELECT date_col FROM`... (so without the `col1`), to give you an idea how fast the `(device_id, date_col, col1)` index might be. — Solarflare, Apr 17 '20 at 08:26
I also don't see a clear *"Everytime I run the query and the time is always lesser than the previous"*, e.g. for 2 and 4. While it could mean that the buffer pool is too small, it can also depend on other things that are going on in the db. You could test with longer breaks, e.g. 10min/1h between repetitions, but in general, I wouldn't focus on this (unless you can freely add more ram). Also, the `group by`-query is a *completely* different query. Focus on one step at a time. Do the proposed tests: a) run workbench on the db server (or verify) b) run `select date_col from` (without col1). — Solarflare, Apr 17 '20 at 08:30
@Solarflare, I've created new index as you suggested (device_id, date_col, col1). it seems working fine now in both cases with/without group by. Thanks — danish, Apr 17 '20 at 10:09
@Solarflare can you please explain why it was not working with previous index (device_id, date_col), I was only using device_id and date_col in WHERE CLAUSE so it supposed to work as expected. right? — danish, Apr 17 '20 at 10:11
@JamesZ it happened by accident, I spent sleepless night and my laptop was out of battery, I was just trying to update my question and missed that you have edited my post. I'm really sorry can you please remove downvote? — danish, Apr 17 '20 at 10:16
Also, when working with RDS never underestimate IOPS contribution to performance (along with CPU/RAM). Are your IO metrics (R/W) well below your limits (provisioned IOPS or GP _allowance_)? — Tasos P., Apr 17 '20 at 10:41
I explained it in the link in my 1st comment. Basically: old index: "read row from index (date&device given). read row from table (to get value for col1). repeat 80k times". new index (or without using col1): "read row from index (all values given). repeat 80k times". If you want a version for your specific situation (although it would look 90% like my other answer) or if the answer isn't clear enough, you would need to clean up your question (or make a new one) to focus on that one query/question (e.g. "why removing col1/adding that index makes it faster", the unclear part, ...) — Solarflare, Apr 17 '20 at 11:05
@danish Please post text results of A) SHOW CREATE TABLE table; B) current query SELECT .....; C) EXPLAIN SELECT rest of your query; D) SELECT @@version; for analysis. — Wilson Hauck, Apr 18 '20 at 13:22

score 0 · Answer 1 · answered Apr 18 '20 at 18:29

(original query)

SELECT  col1 , date_col
    FROM  table USE INDEX (device_date_col)
    WHERE  device_id = "some_value"
      AND  date_col BETWEEN "2020-03-16 00:00:00"
                        AND "2020-04-16 00:00:00"
    limit  1000000 ;

Discussion of INDEX(device_id, date_col, col1)

Start an index with = column(s), name,y device_id. This focuses the search somewhat.
Within that, further focus on the date range. So, add date_col to the index. You now have the optimal index for the WHERE
Tack on all the other columns showing up anywhere in the SELECT if it is not too many columns and includes no TEXT columns. Now you have a "covering" index. This allows the query to be performed using just the index's BTree, thereby giving a further boost in speed.

More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Other notes

LIMIT without ORDER BY is usually not meaningful -- you are at risk of getting a random set of rows.

That BETWEEN includes an extra midnight. I suggest

AND  date_col >= "2020-03-16"
AND  date_col  < "2020-03-16" + INTERVAL 1 MONTH

Remove the USE INDEX -- It may help today, but it could hurt tomorrow, when the data changes or the constants change.
LIMIT 1000000 -- This could choke some clients. Do you really need that many rows? Perhaps more processing could be done in the database?
Adding on the GROUP BY -- Could there be two values for col1 within some of the minutes? Which value of col1 will you get? Consider MAX(col1), ANY_VALUE(col1), or GROUP_CONCAT(DISTINCT col1).

AWS RDS MySql Simple query with composite index and date range took too long to execute out of ~8 millions data

UPDATE

1 Answers1