Optimize top N queries

Question

I am facing a hard time optimizing a query like

SELECT RESULT_ID FROM RESULTS 
WHERE SOURCE = 1 AND GROUP=2 AND SCORE1 BETWEEN 20 AND 100 
ORDER BY SCORE2 LIMIT 450;

on a 40 million rows innodb table. The query may have to sort upto 15 million results to get the top 450. So far, I have tried :

Defining indexes but those don't get used to sort because MySQL ignores any columns in the index after the range condition. Since we have a bunch of score columns, we could get range conditions on a number of them followed by sorting on a particular score and the limiting the result set to top 450.
Using memory tables, but those don't perform well when sorting such large results.
Sphinx, but I am not sure if it will help in these kinds of queries.

Also, Is there any OLAP cube implementation that can optimize these kind of queries ?

Have you tried `tmp_table_size = really huge value` to have as much sorting as possible done in memory? Have you tried if partitioning on source and group improves thing? — C. Ramseyer, Aug 11 '11 at 07:56
Is your data set read-only, or is it constantly changing? How often would you need to recompute the top 450? Does it need to be *exact* in real time, at any time? — Savino Sguera, Aug 11 '11 at 08:07
@Anu: I guess you already tried with an index on `(source, group, score1)` and one on `(score2)`, right? — ypercubeᵀᴹ, Aug 11 '11 at 08:26
@ramseyer I have played around with sort_buffer_size but that didn't help much. — Prince, Aug 11 '11 at 14:43
@savinos existing rows are never updated but new rows get added to the result set. I always have to fetch the top 450. — Prince, Aug 11 '11 at 14:43
Just curious: are you able to run the query? "group" is a reserved word in MySql, at least from version 5. http://dev.mysql.com/doc/mysqld-version-reference/en/mysqld-version-reference-reservedwords-5-0.html — armandino, Aug 11 '11 at 23:29

score 1 · Answer 1 · answered Aug 11 '11 at 07:51

1

I'd suggest creating a separate table which holds these 450 rows and is calculated every time a new row is inserted or old is updated and refer to the other table.

That way your query wouldn't need to browse all the rows every time.

answered Aug 11 '11 at 07:51

Robin Castlin

10,956
1
28
44

Too bad MySQL doesn't support materialized views – Phil Aug 11 '11 at 07:55
1

Recomputing everything on every insert sounds a bit heavy to me. – Savino Sguera Aug 11 '11 at 08:03
Store the 450th best score to determine if you need to update these 450 rows. Since you have as many rows as you claim, this should not be needed to do often, and if you're presenting a top score, they will only load these 450 rows. – Robin Castlin Aug 11 '11 at 13:35
@robin The problem is that I have a bunch of score columns and users may want to view the top 450 with range constraints specified on any combination of score columns. So it is difficult to predict how users may query this data. I need to filter and present the top 450 on the fly. – Prince Aug 11 '11 at 14:49
I'd imagine doing a INSERT UPDATE to an extra UNIQUE column, which saves the order as 1,2,3 etc. This woild be achived by using the values @rownum and INSERT UPDATE the rows. This query should be run every once a minute. Let me know if you need code samples. – Robin Castlin Aug 11 '11 at 14:54

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

What you are looking for, IMHO, is a way to get the top K items in a (theoretically) infinite stream of items.

I would not try to solve this directly in mysql as your input is a stream and not a fixed dataset. Also, given the dataset size, recomputing the top K from scratch on every insert is out of question.

What I would do is have a compact representation of the top K that you update as new items come in. For each element, take its score, and keep a heap of the top K elements seen so far.

A bit more formally: given a data stream q1, . . . , qn, add qj to the heap if Score(qj) is greater than the smallest score in the heap. In this case, the smallest estimated score should be evicted from the heap.

Specific solution

You have several score columns, and a user may ask the top 450 for any combination of columns, using range queries.

What I would do, conceptually, is:

keep the top 450 in a heap for each score column separately, using the streaming approach above
at query time, get the items that match the query column-wise
aggregate and sort the lists as needed, and cut at 450

Hope it helps.

I like your idea but the issue is that the user may want to see the top 450 scores within a specified score range. For example, one user may want to see top results from score range 50 to 100 while another from 100 to 200. So, just maintaining a list of top 450 for each score won't do it. I need an efficient way to store the entire data set and get the top 450 depending on the specified score ranges. — Prince, Aug 11 '11 at 23:25

score 1 · Answer 3 · answered Aug 12 '11 at 00:03

You can pre-specify common score ranges. For instance you can create several types of ranges:

                1          2           3           4
RANGE_50  = { 0..50,    51..100,   101..150,   151..200 }
RANGE_100 = { 0..100,   101..200                        }
RANGE_200 = { 0..200                                    }

These range types can be created as columns in your table and have to be updated according to the value of score1.

Then you will be able to use queries like this:

SELECT RESULT_ID FROM RESULTS 
WHERE SOURCE = 1 AND GROUP=2 AND RANGE_100 = 2 
ORDER BY SCORE2 LIMIT 450;

Optimize top N queries

3 Answers3

Specific solution