Very slow performance on thousands of millions records in MYSQL query

Question

id    market_id      date          keyword                 sku            a             b         c  
1        1        2019-01-01    some text for this      QAB-XU-VV        3.1           2.4       3.5
2        2        2019-01-02    some text for text      ABC-XA-VV        2.1           4.1       1.2

This is an instance of table A

`id` : `int primary key auto increment`
`market_id` : `int(4)`  `INDEX`
`date` : `date`   `INDEX`
`keyword`: `varchar(191)` `utf8_general_ci`
`sku`: `varchar(191)` INDEX `utf8_general_ci`
`a, b, c` : `float`

I need to query like this

SELECT
  sku,
  keyword,
  market_id,
  SUM(a),
  SUM(b),
  SUM(c),
FROM A
WHERE market_id = 2 AND date BETWEEN '2020-01-01' and '2020-02-02'
GROUP BY sku, keyword;

This query is expected about a hundred of thousand records. This query takes over 5 mins. so I used LIMIT and OFFSET but no improvements.

Please help me. Thank you.

UPDATED:

This shows EXPLAIN result.

score 4 · Answer 1 · answered Feb 05 '20 at 12:06

4

For this query:

SELECT sku, keyword, market_id,
       SUM(a), SUM(b), SUM(c)
FROM A
WHERE market_id = 2 AND
      date BETWEEN '2020-01-01' and '2020-02-02'
GROUP BY sku, keyword, market_id;

I would recommend a composite index on A(market_id, date, sku, keyword, a, b, c).

This is a covering index so the original data pages do not need to be read. You can also use A(market_id, date).

answered Feb 05 '20 at 12:06

Gordon Linoff

1,242,037
58
646
786

Thank you for the quick help. I will try soon but please guide me composite indexing is faster than individual indexing? so I need to remove original indexes and then create composite index again on the original table? – Alex Feb 05 '20 at 12:12
@AlexandrBiship . . . You can have as many indexes on the table as you need. However, I would recommend dropping the index on `market_id` and replacing it with `(market_id, date)`. – Gordon Linoff Feb 05 '20 at 12:18
So If I use `A(market_id, date)`, should I keep individual indexes for `sku` and `keyword` as well or drop all indexes and just use `A(market_id, date)`? – Alex Feb 05 '20 at 13:28
I've just tested with `A(market_id, date, sku, keyword, a, b, c)`. much improvement. count of records was about 140,000 and took time 90 seconds. `A(market_id, date) `was also similar. Any suggestions? Thank you. – Alex Feb 05 '20 at 13:35
This query result gets about 300,000 records so I need to use LIMIT and OFFSET. but there is no effect even if I use this. please guide me about this as well. – Alex Feb 05 '20 at 15:27
@AlexandrBiship . . . The `sku` and `keyword` indexes would not be used for this query, but they might be useful for other queries. If the aggregation is *resulting in* 140,000 rows, then you have a lot of data going in. MySQL has to sort a bunch of data for the `group by`. – Gordon Linoff Feb 05 '20 at 18:16
I see. The updated query takes time 1 min 30 seconds and `LIMIT` is no effect. Is there another way to improve query time? I just used `A(market_id, date)`. – Alex Feb 05 '20 at 18:20
@AlexandrBiship . . . I don't think so. You are returning a lot of data that needs to be aggregated. – Gordon Linoff Feb 05 '20 at 18:28
I just updated my question. Even If I use `LIMIT` to restrict the count of records, query time is same as getting results without `LIMIT`. – Alex Feb 05 '20 at 18:31
Btw, aggregated result can have `300,000` of records – Alex Feb 05 '20 at 18:33
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/207302/discussion-between-alexandr-biship-and-gordon-linoff). – Alex Feb 05 '20 at 19:10

score 1 · Answer 2 · answered Feb 06 '20 at 00:40

1

Gordon explained most things.

Note that GROUP BY and/or ORDER BY usually cause a break in the flow, thereby preventing most of the benefit of LIMIT. Well, actually there is a small benefit in LIMIT, but it is only the transmission effort and time for delivering a different-sized resultset.

As for improving further? As things stand with the current query and Gordon's "covering" index, no.

But... If this is a time-series of data and you could compute subtotals for the SUMs each day. Then fetching the grand total for 33 days can be made much faster; perhaps 10x faster.

Tentatively the summary table would have

PRIMARY KEY(market_id, date, sku, keyword)

And other columns of sum_a, sum_b, sum_c, and you would add new rows to the table each night.

I discuss "Summary tables" in http://mysql.rjweb.org/doc.php/summarytables

For indexing tips, see http://mysql.rjweb.org/doc.php/index_cookbook_mysql

answered Feb 06 '20 at 00:40

Rick James

135,179
13
127
222

Thank you @Rick James. But I need to aggregate data according to date frame that client will choose on the client side. In order words, date frame is not constant. – Alex Feb 06 '20 at 05:28
@AlexandrBiship - But if it is always some multiple of days or hours (or ...) then that would be the unit for summarizing. – Rick James Feb 06 '20 at 05:54
Yes. you're right. I already have base data on another table and this table(`A`) is a summary table that is already aggregated per day. So what I just need to do is how to improve speed in this current table `A`. – Alex Feb 06 '20 at 06:03
I used `A(sku, keyword)` indexes for `group by` columns and query time is pretty fast(`EXPLAIN` shows this is `INDEX` scanning) but if I use high `OFFSET` along with `LIMIT`(since result can be hundreds of thousand records. so pagination is needed), then query time is getting slower. – Alex Feb 06 '20 at 06:06
@AlexandrBiship - How many rows are returned? Is the "pagination" for a web page? What about jumping to a range of SKUs instead of paginating through the entire output? (In other words, let's look at other approaches to the speed issue.) – Rick James Feb 06 '20 at 15:29
Thank you @Rick James. 1. About 300,000 records. 2.Yes. it's for pagination. 3. SKU is string type and it's not comparable. Please let me know if you have any suggestions. – Alex Feb 06 '20 at 15:44
@AlexandrBiship - How does the end-user find the row(s) he wants? It could take a looooong time to page through 300K rows. If the user has a SKU in mind, he should be able to type all or part of it and go directly to the row(s). Please describe the "user experience" so I can focus in on what part to work on. – Rick James Feb 06 '20 at 16:11
The end-user wants to see all results at a time without SKU or keyword input but I think I need to go with your suggestion. @Rick James – Alex Feb 06 '20 at 16:40
@AlexandrBiship - All 300K results?? I have trouble imagining that. Maybe he could "download" 300K, not "display" 300K?? Let's see your `LIMIT` and `OFFSET`. – Rick James Feb 06 '20 at 16:43

Very slow performance on thousands of millions records in MYSQL query

UPDATED:

2 Answers2