Please help me optimize this MySQL SELECT statement

Question

I have a query that takes roughly four minutes to run on a high powered SSD server with no other notable processes running. I'd like to make it faster if possible.

The database stores a match history for a popular video game called Dota 2. In this game, ten players (five on each team) each select a "hero" and battle it out.

The intention of my query is to create a list of past matches along with how much of a "XP dependence" each team had, based on the heroes used. With 200,000 matches (and a 2,000,000 row matches-to-heroes relationship table) the query takes about four minutes. With 1,000,000 matches, it takes roughly 15.

I have full control of the server, so any configuration suggestions are also appreciated. Thanks for any help guys. Here are the details...

CREATE TABLE matches (
*   match_id BIGINT UNSIGNED NOT NULL,
    start_time INT UNSIGNED NOT NULL,
    skill_level TINYINT NOT NULL DEFAULT -1,
*   winning_team TINYINT UNSIGNED NOT NULL,
    PRIMARY KEY (match_id),
    KEY start_time (start_time),
    KEY skill_level (skill_level),
    KEY winning_team (winning_team));

CREATE TABLE heroes (
*   hero_id SMALLINT UNSIGNED NOT NULL,
    name CHAR(40) NOT NULL DEFAULT '',
    faction TINYINT NOT NULL DEFAULT -1,
    primary_attribute TINYINT NOT NULL DEFAULT -1,
    group_index TINYINT NOT NULL DEFAULT -1,
    match_count BIGINT UNSIGNED NOT NULL DEFAULT 0,
    win_count BIGINT UNSIGNED NOT NULL DEFAULT 0,
*   xp_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
*   team_xp_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
    xp_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
    team_xp_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
    gold_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
    team_gold_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
    gold_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
    team_gold_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
    included TINYINT UNSIGNED NOT NULL DEFAULT 0,
    PRIMARY KEY (hero_id));

CREATE TABLE matches_heroes (
*   match_id BIGINT UNSIGNED NOT NULL,
    player_id INT UNSIGNED NOT NULL,
*   hero_id SMALLINT UNSIGNED NOT NULL,
    xp_per_min SMALLINT UNSIGNED NOT NULL,
    gold_per_min SMALLINT UNSIGNED NOT NULL,
    position TINYINT UNSIGNED NOT NULL,
    PRIMARY KEY (match_id, hero_id),
    KEY match_id (match_id),
    KEY player_id (player_id),
    KEY hero_id (hero_id),
    KEY xp_per_min (xp_per_min),
    KEY gold_per_min (gold_per_min),
    KEY position (position));

Query

SELECT
    matches.match_id,
    SUM(CASE     
        WHEN position < 5 THEN xp_from_wins / team_xp_from_wins     
        ELSE 0    
    END) AS radiant_xp_dependence,
    SUM(CASE     
        WHEN position >= 5 THEN xp_from_wins / team_xp_from_wins     
        ELSE 0    
    END) AS dire_xp_dependence,
    winning_team   
FROM
    matches   
INNER JOIN
    matches_heroes     
        ON matches.match_id = matches_heroes.match_id   
INNER JOIN
    heroes     
        ON matches_heroes.hero_id = heroes.hero_id   
GROUP BY
    matches.match_id

Sample Results

match_id   | radiant_xp_dependence | dire_xp_dependence | winning_team

2298874871 | 1.0164                | 0.9689             | 1
2298884079 | 0.9932                | 1.0390             | 0
2298885606 | 0.9877                | 1.0015             | 1

EXPLAIN

id | select_type | table          | type   | possible_keys            | key     | key_len | ref                            | rows | Extra

1  | SIMPLE      | heroes         | ALL    | PRIMARY                  | NULL    | NULL    | NULL                           | 111  | Using temporary; Using filesort
1  | SIMPLE      | matches_heroes | ref    | PRIMARY,match_id,hero_id | hero_id | 2       | dota_2.heroes.hero_id          | 3213 |
1  | SIMPLE      | matches        | eq_ref | PRIMARY                  | PRIMARY | 8       | dota_2.matches_heroes.match_id | 1    |

Machine Specs

Intel Xeon E5
E5-1630v3 4/8t
3.7 / 3.8 GHz
64 GB of RAM
DDR4 ECC 2133 MHz
2 x 480GB of SSD SOFT

Database

MariaDB 10.0
InnoDB

I'm not an SQL expert (by a long stretch) and your queries look OK to me (again, underlining the fact that I'm no expert), so I'd go look here first of all : https://mariadb.com/kb/en/mariadb/configuring-mariadb-for-optimal-performance/ to make sure your server is configured properly. Might be an easy fix.. if not, it at least eliminates mis-configuration as a possible cause. — FreudianSlip, Apr 17 '16 at 07:40
Can position be anything other than a number greater than, equal to, or less than 5? — Strawberry, Apr 17 '16 at 08:13
@Mihai - I have already tried adding indexes on all the unindexed fields to make sure it wasn't that. The query took just as long. Thanks though. — DaiBu, Apr 17 '16 at 08:54
@Strawberry - The position column indicates what team they're on and in what order. 0-4 is team A, 5-9 is team B. — DaiBu, Apr 17 '16 at 08:56
There is little you can do,since you dont have a WHERE clause,you just select all the data.Maybe create a new column to store `xp_from_wins / team_xp_from_wins` and update it on insert with a trigger,avoiding calculations on the fly — Mihai, Apr 17 '16 at 08:56
@Mihai - Ah, that's an idea. And is it possible that there would be better performance if I did it in chunks (WHERE start_time > 123 AND start_time < 456)? — DaiBu, Apr 17 '16 at 08:59
Probably,but it depends on the selectivity-the percentage of rows returned compared to the total number of rows.Indexes are most useful when you need a few rows — Mihai, Apr 17 '16 at 09:02
@Mahai - Ok, well I'll try it and see. And I always forget about triggers. How large can you go on them? There is a lot of database updates in I do in PHP after the match INSERTs, but can I just add tons of SQL code as triggers or is that bad practice? It would make the process atomic also, right? — DaiBu, Apr 17 '16 at 09:06
As to pre-calculated `xp_from_wins / team_xp_from_wins`: I don't think that calculation takes a lot of time. It's reading all that data physically from the drive and then having to join all records with a huge intermediate result which has to be grouped and aggregated, that takes time. — Thorsten Kettner, Apr 17 '16 at 09:18
@Strawberry - Are you sure? How else can I write it? The second one is in a separate SUM statement. I don't know how to say "everything not in the previous CASE". — DaiBu, Apr 17 '16 at 09:31
@FreudianSlip - I checked out that page. There's a couple things on there I'm going to look into further. Thanks for the link. — DaiBu, Apr 17 '16 at 09:52
@Strawberry - I understand it to mean "everything not in this CASE." Can you rewrite the statement in the way that you're suggesting please? It must be a simple change but I'm afraid I'm not following you. — DaiBu, Apr 18 '16 at 03:18

score 3 · Accepted Answer · answered Apr 17 '16 at 11:22

In all likelihood, the main performance driver is the GROUP BY. Sometimes, in MySQL, it can be faster to use correlated subuqeries. So, try writing the query like this:

SELECT m.match_id,
       (SELECT SUM(h.xp_from_wins / h.team_xp_from_wins)
        FROM matches_heroes mh INNER JOIN
             heroes h   
             ON mh.hero_id = h.hero_id
        WHERE m.match_id = mh.match_id AND mh.position < 5
       ) AS radiant_xp_dependence,
       (SELECT SUM(h.xp_from_wins / h.team_xp_from_wins)
        FROM matches_heroes mh INNER JOIN
             heroes h   
             ON mh.hero_id = h.hero_id
        WHERE m.match_id = mh.match_id AND mh.position >= 5
       ) AS dire_xp_dependence,
       m.winning_team   
FROM matches m;

Then, you want indexes on:

matches_heroes(match_id, position)
heroes(hero_id, xp_from_wins, team_xp_from_wins)

For completeness, you might want this index as well:

matches(match_id, winning_team)

This would be more important if you added order by match_id to the query.

Ok, sure. I'm going to sleep soon, but I'll give it a run tomorrow and see if it helps. Thanks man. — DaiBu, Apr 17 '16 at 12:56
Wow! I had already cut the time down a ton by increasing the InnoDB buffer pool size. This cut the remaining time in half, from two minutes to only one. Thanks a lot! — DaiBu, Apr 18 '16 at 21:51
What's your final result with the query change and buffer change implemented - how much speed up is attributable to each? Would be interesting to know! — Jim, Jul 14 '16 at 03:55

score 2 · Answer 2 · answered Apr 17 '16 at 09:15

2

As has already been mentioned in a comment; there is little you can do, because you select all data from the table. The query looks perfect.

The one idea that comes to mind are covering indexes. With indexes containing all data needed for the query, the tables themselves don't have to be accessed anymore.

CREATE INDEX matches_quick ON matches(match_id, winning_team);

CREATE INDEX heroes_quick ON heroes(hero_id, xp_from_wins, team_xp_from_wins);

CREATE INDEX matches_heroes_quick ON matches_heroes (match_id, hero_id, position);

There is no guarantee for this to speed up your query, as you are still reading all data, so running through the indexes may be just as much work as reading the tables. But there is a chance that the joins will be faster and there would probably be less physical read. Just give it a try.

answered Apr 17 '16 at 09:15

Thorsten Kettner

89,309
7
49
73

Ok, I'll try that. I remember you helped me with one of my other queries and we decided it couldn't be improved either. Heh. Thanks man. – DaiBu Apr 17 '16 at 09:27
I'll give some other people time to answer before I accept yours, in case someone thinks of something brilliant. – DaiBu Apr 17 '16 at 09:34
One more thing and I hope it's an easy answer. I need to start tracking the six "items" that the players used. Creating a "matches_heroes_items" relationship table would mean 60 entries for every match. I may have a billion matches in my database someday, so is it a good idea to make a table that could potentially have 60 billion rows? Or should I just add them as columns (item_0, item_1, etc) in the matches_heros table? – DaiBu Apr 17 '16 at 09:45
Hi. My answer only covers the SQL. I'd hope, too, that someone sheds some light on possible database configuration optimizations. And yes, maybe someone even comes up with another idea for the query. As to the six items: go for the table. Otherwise simple queries like how many players used an axe or which item was used most often become cumbersome. Don't worry about billions of entries. You are usually only interested in few of them and the records would quickly be found via indexes. – Thorsten Kettner Apr 17 '16 at 10:05
Hi, is there any advantage if we consider the order in which tables are joined? Get done with smaller tables in initial steps and then going for the bigger tables? – I_am_Batman Apr 17 '16 at 10:12
Ok, sure. I guess we think of numbers like billions as being really, really large and unwieldy. I guess to a database, it's only a few more search steps down the tree. Thanks again. – DaiBu Apr 17 '16 at 10:17
1

@I_am_Batman YOu can JOIN in any order,mysql will reqrite the JOIN with the table having the smallest number of rows being first – Mihai Apr 17 '16 at 15:56
Thorsten, you might find it interesting that your covering indexes didn't improve anything but Gordon's restructured query along with the covering indexes he provided (a little different than yours), did. Just alerting you, in case you wanted to learn from it. – DaiBu Apr 18 '16 at 22:00
@DaiBu: Wow, yes, thank you, this is interesting information. – Thorsten Kettner Apr 19 '16 at 08:47
And also the answer I posted. It turns out that having the whole table loaded into RAM makes queries on it incredibly faster (who would've though >_<). – DaiBu Apr 19 '16 at 10:44
Yes. It's a great idea you posted the settings so others can compare theirs with yours. – Thorsten Kettner Apr 19 '16 at 12:38

score 1 · Answer 3 · answered Apr 17 '16 at 10:41

1

Waiting for another idea? :-)

Well, there is always the data warehouse approach. If you must run this query again and again and always for all matches ever played, then why not store the query results and access them later?

I suppose that matches played won't be altered, so you could access all results you computed, say, last week and only retrieve the additional results from the games since then from your real tables.

Create a table archived_results. Add a flag archived in your matches table. Then add query results to the archived_results table and set the flag to TRUE for these matches. When having to perform your query, you'd either update the archived_results table anew and only show its contents then or you'd combine archive and current:

select match_id, radiant_xp_dependence, radiant_xp_dependence winning_team
from archived_results
union all
SELECT
    matches.match_id,
    SUM(CASE     
        WHEN position < 5 THEN xp_from_wins / team_xp_from_wins     
        ELSE 0    
    END) AS radiant_xp_dependence,
...
WHERE matches.archived = FALSE
GROUP BY matches.match_id;

answered Apr 17 '16 at 10:41

Thorsten Kettner

89,309
7
49
73

I'm not great at SQL so I don't fully understand your query without looking UNIONs up, but I don't think storing any data is beneficial in this situation. heroes.xp_from_wins and heroes.team_xp_from_wins (the input values being used in the query) is calculated from the matches. When new matches are added, these values change. I do store the results in a file, which is then downloaded to an app. You can check it out if you're bored and have an iOS device. It's called "Inspectre" on the App Store (it's free, and no ads or anything). – DaiBu Apr 17 '16 at 12:52
You group by matches_id. So you get one result row per matches_id. Are the values for matches_id 1 ever being changed? I thought match 1 was played some time in the past and its values would be static now, i.e. whenever you execute the query you get the same result row for matches_id 1. Hence the idea to store this line in an archive table. You wouldn't have to compute the row for matches_id 1 anymore; it would be ready in your archives table. Only for new matches not yet in the archives table would you scan your tables, which would be rather quick. UNION ALL simply combines two result sets. – Thorsten Kettner Apr 17 '16 at 14:02
Yeah, let me try to explain better. The xp_from_wins and team_xp_from_wins in the heroes table represents each hero's XP sum from all matches in the database. When I add new matches or delete old ones, these values are updated. Because of this, the radiant_xp_dependence and dire_xp_dependence values change also, reflecting changes in the game's meta. In English, the query is saying, "Based on each hero's standard share of a team's XP given the current match set, how XP dependent is each team in this match?" The match set is being updated by about 10 matches per second. – DaiBu Apr 17 '16 at 15:57
If you're wondering what the purpose is, I use the results to determine what affect xp dependence has on win rate. It's just one of the many factors involved in predicting a team's win rate. I wasn't thinking when I told you to download the app. There's a website also, if you wanna see the end result. So much easier. http://4xmulticast.com/inspectre. – DaiBu Apr 17 '16 at 16:06

score 1 · Answer 4 · answered Apr 18 '16 at 20:20

People's comments about loading whole tables into memory got me thinking. I searched for "MySQL memory allocation" and learned how to change the buffer pool size for InnoDB tables. The default is much smaller than my database, so I ramped it up to 8 GB using the innodb_buffer_pool_size directive in my.cnf. The speed of the query increased drastically from 1308 seconds to only 114.

After researching more settings, my my.cnf file now looks like the following (no further speed improvements, but it should be better in other situations).

[mysqld]
bind-address=127.0.0.1
character-set-server=utf8
collation-server=utf8_general_ci
innodb_buffer_pool_size=8G
innodb_buffer_pool_dump_at_shutdown=1
innodb_buffer_pool_load_at_startup=1
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size=8M
innodb_log_file_size=64M
innodb_read_io_threads=64
innodb_write_io_threads=64

Thanks everyone for taking the time to help out. This will be a massive improvement to my website.

After I posted this, I tried Gordon's suggestion. It cut the remaining time in half and directly answered the request to help me optimize the query, so I gave him the checkmark. — DaiBu, Apr 18 '16 at 21:57

Please help me optimize this MySQL SELECT statement

4 Answers4