10

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.

First, monstrous table layout:

  • category
  • element_id
  • date_updated
  • value
  • weight
  • source_prefix
  • source_name

Only key here is on element_id (BTREE, ~8k unique elements).

And calculation process:

Make hash for each group and subgroup.

CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
                SELECT `category`, 
                `element_id`, 
                `source_prefix`, 
                `source_name`, 
                `date_updated`, 
                `value`, 
                `weight`, 
                MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`, 
                MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash` 
                FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'

I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.

Find maximum date for each subgroup

CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))

                SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
                FROM `temp1`
                GROUP BY `subcat_hash`;

Join temp1 with temp2 to find weighted average values for categories

CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
            SELECT `temp1`.`element_id`, 
                   `temp1`.`category`, 
                   `temp1`.`source_prefix`, 
                   `temp1`.`source_name`, 
                   `temp1`.`date_updated`, 
                   AVG(`temp1`.`value`) AS `avg_value`,
            SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`

            FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
            WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
            AND `temp1`.`date_updated` = `temp2`.`maxdate`

            GROUP BY `temp1`.`cat_hash`;

(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).

Still, is there a normal way to do so?

UPD: some picture for reference:

removed dead ImageShack link

UPD: EXPLAIN for proposed solution:

+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key        | key_len | ref                                                                                  | rows   | filtered | Extra                                        |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
|  1 | SIMPLE      | cur   | ALL  | NULL          | NULL       | NULL    | NULL                                                                                 | 893085 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | next  | ref  | prefix        | prefix     | 1074    | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id |      1 |   100.00 | Using where                                  |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+    
SuperBiasedMan
  • 9,814
  • 10
  • 45
  • 73
Kuroki Kaze
  • 8,161
  • 4
  • 36
  • 48

2 Answers2

5

Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.

Based on your comment, this query might do what you are looking for:

SELECT cur.source_prefix, 
       cur.source_name, 
       cur.category, 
       cur.element_id,
       MAX(cur.date_updated) AS DateUpdated, 
       AVG(cur.value) AS AvgValue,
       SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
    ON next.date_updated < '2009-05-01'
    AND next.source_prefix = cur.source_prefix 
    AND next.source_name = cur.source_name
    AND next.element_id = cur.element_id
    AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name, 
    cur.category, cur.element_id

The GROUP BY performs the calculations per source+category+element.

The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).

There are many ways of filtering out old entries, but this one tends to perform resonably well.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • Okay, i'll try to explain. There is measurements in this table. Each measurement have source (identified by prefix + name) and category. Each element can have measurements in all categories, or just in some. What i want to do is find latest measurement for element from a source, then calculate weighted average for elements+categories. Sorry for my English, btw - not my main language :\ – Kuroki Kaze May 22 '09 at 10:34
  • Post updated. Is the date_updated *exactly* equal for all of the latest measurements? Or are they just on the same day? – Andomar May 22 '09 at 10:49
  • They're just latest for the same source and element. They may vary. – Kuroki Kaze May 22 '09 at 10:53
  • Edited again so it looks for the latest date_updated per source+element. It then groups on all the categories that have a measurement for that particular date_updated. – Andomar May 22 '09 at 10:56
  • Added picture to post :) In the meantime I'll try your query, thanks :) – Kuroki Kaze May 22 '09 at 11:11
  • 1
    Updated for the cut-out date... now I'm curious how you'll get it to work :) hehe – Andomar May 22 '09 at 11:19
  • At least i'm always have an option of leaving current monstrocity in place and labeling it as "Magic" for future developers :) – Kuroki Kaze May 22 '09 at 11:21
  • The answer includes rows for 2009-04-29 and 2009-04-30 that the original doesn't. :D – Jonathan Leffler May 22 '09 at 11:38
  • Ok, index created in 18 min 52.54 sec. On to the query :) – Kuroki Kaze May 22 '09 at 13:25
  • 66 hours and counting. Too long :( – Kuroki Kaze May 25 '09 at 08:37
  • Maybe it's trying a hash join and it doesn't fit into memory. Try replacing "LEFT JOIN" with "LEFT LOOP JOIN". How much memory does the server have? Maybe post the result of "EXPLAIN ". – Andomar May 25 '09 at 09:03
  • Added EXPLAIN forr solution. It seems to me it can't use indexes for join. – Kuroki Kaze May 26 '09 at 14:12
  • Did you try the LOOP join? Can you post EXPLAIN EXTENDED, and SHOW WARNINGS, like in this blog post? http://www.mysqlperformanceblog.com/2006/07/24/extended-explain/ – Andomar May 26 '09 at 14:16
  • Also, no "LOOP JOIN" in MySQL, sorry ) – Kuroki Kaze May 26 '09 at 14:16
  • Added EXPLAIN EXTENDED. Only warning shows my query with code 1003. – Kuroki Kaze May 26 '09 at 14:30
  • I wonder why it's not using the index. I've edited the query so it doesn't contain subqueries; does this allow MySQL to use the index? Maybe copy a subset from eev0 to testeev0 for testing. – Andomar May 26 '09 at 14:34
  • Whoa, EXPLAIN for this looks good (added to post). Waiting for actual query to finish :) – Kuroki Kaze May 26 '09 at 14:47
  • Also, are all ambiguous column names belong to `cur` instance of table? – Kuroki Kaze May 26 '09 at 14:49
  • 1
    Prefixed the column names; not sure what MySQL does with ambiguous names, I would expect it to throw an error. – Andomar May 26 '09 at 14:52
  • Now i'll check resulting values — what do we have here? :) – Kuroki Kaze May 26 '09 at 15:23
  • A query like this should run in seconds, not minutes, so I'm still confused. Don't forget to verify the results for a sample! – Andomar May 26 '09 at 15:28
  • It seems to me that all data at this point was relevant, so we filtered out by date exactly zero items (this table is filled in by remote script). So it was an aggregate function of over 180 Mb file. You think it should take seconds? – Kuroki Kaze May 26 '09 at 15:38
  • I guess it depends on the amount of memory you have, but if you have 1GB of RAM, then a query over 180 MB should run in seconds. Here I just did an aggregate over a 9 gigabyte table with 22 million rows, and it finished in less than 1 second (not joining on anything.) – Andomar May 26 '09 at 15:43
  • I'm really not into FreeBSD, but i presume the server should have at least gig. – Kuroki Kaze May 26 '09 at 15:49
  • Second run finished in 27 minutes (first one was printing directly on screen, second uses "CREATE TABLE"), the results seems reasonable. Net buffer length is set to ~8k, max_allowed_packet is ~16M. I suppose this limits the memory for server? – Kuroki Kaze May 26 '09 at 15:58
  • These sound like network settings, you'd check the memory with "ps" or "top" on the server (I guess-- I'm not into FreeBSD either!) – Andomar May 26 '09 at 20:55
  • It's MySQL settings. I'm still curious if i can squeeze this query into seconds (or at least in under 10 minutes). Thanks for help :) – Kuroki Kaze May 27 '09 at 09:19
  • You could try to load the data into Sql Server Express http://www.microsoft.com/express/sql/default.aspx – Andomar May 27 '09 at 10:20
  • Thanks, i don't think i'm ready to move this project onto SQL server :) – Kuroki Kaze May 27 '09 at 16:27
3

Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.

First things first, which of the 3 statements above is taking the most time?

The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.

Create an index on the "data_updated" column, then run the query again and see what that does for you.

If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.

Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.

Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :

SELECT temp1.element_id, 
   temp1.category, 
   temp1.source_prefix, 
   temp1.source_name, 
   temp1.date_updated, 
   AVG(temp1.value) AS avg_value,
   SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;

or

SELECT temp1.element_id, 
   temp1.category, 
   temp1.source_prefix, 
   temp1.source_name, 
   temp1.date_updated, 
   AVG(temp1.value) AS avg_value,
   SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
Glen
  • 21,816
  • 3
  • 61
  • 76
  • Last one. First is near instant, second is about 23 minutes. – Kuroki Kaze May 22 '09 at 10:23
  • I can remove hashes but then query will take an infinite time (okay, maybe not, but i don't have such patience, nor the clients). I suppose these hashes can be made into the indexes somehow. – Kuroki Kaze May 22 '09 at 10:25
  • Don't think the index suggestion makes sense. An aggregate query like this will always result in a full table scan. – Andomar May 22 '09 at 10:30