-1

I'm running a fairly complicated SQL statement to create a summary table from a big table (38 million rows) of raw data. (I'm trying to get current, low-this-season, high-this-season, percent-the-price-has-been-1-cent-this-week/month/season into a cache table for easy querying later.)

INSERT INTO cache (`time`, name, price, low, high, week, month, season)
    SELECT
        MAX(`time`) AS `time`,
        name,
        MIN(CASE WHEN `time` = 1498511444 THEN price ELSE 999999 END) AS price,
        MIN(price) AS low,
        MAX(price) AS high,
        SUM(CASE WHEN `time` > 1497906644 AND price = 1 THEN 1 ELSE 0 END) / SUM(CASE WHEN `time` > 1497906644 THEN 1 ELSE 0 END) AS week,
        SUM(CASE WHEN `time` > 1480367444 AND price = 1 THEN 1 ELSE 0 END) / SUM(CASE WHEN `time` > 1480367444 THEN 1 ELSE 0 END) AS month,
        SUM(CASE WHEN `time` > 1493362800 AND price = 1 THEN 1 ELSE 0 END) / SUM(CASE WHEN `time` > 1493362800 THEN 1 ELSE 0 END) AS season
    FROM
        (SELECT
            `time`,
            name,
            MIN(price) AS price
        FROM price
        WHERE `time` > 1493362800
        GROUP BY `time`, name) AS tmp
    GROUP BY name

After adding an index on the price.time column I managed to get this down to 0.6s on local (it took 30s before). On prod (with the same index) it takes a very long time (30s+) and then fails with Errcode: 28 - No space left on device. If I watch df while it is running I see the free space slowly dwindle from 9.9G to 9.6G at about 3MB/s. Then after a few minutes free space suddenly starts dropping 500MB/s until there is no space left and the query fails. On local there's seemingly no blip in available space although I guess it could be so fast that my df in a while loop isn't seeing it.

I also get the disk-eating behavior if I try and create a table containing the results of the subquery first:

INSERT INTO initial_cache (`time`, name, price)
SELECT
    `time`,
    name,
    MIN(price) AS price
FROM price
WHERE `time` > 1493337600
GROUP BY `time`, name

Do you know why my query needs so much space to run? And why it would behave so differently on prod?

Thanks!

Thomas David Baker
  • 1,037
  • 10
  • 24
  • 2
    verify the device your /tmp is on by `cd /tmp; df -h .` as adding the . will just show df stats for the directory's disk – tawman Jun 26 '17 at 18:06
  • Yup, it's on /dev/root ... Filesystem Size Used Avail Use% Mounted on /dev/root 24G 13G 9.9G 56% / – Thomas David Baker Jun 26 '17 at 18:09
  • 1
    Your `/tmp` partition is full, or at least whatever operation you're trying to do is exhausting whatever space is available there. While the operation is ongoing keep an eye on `df -h` because I bet you'll see it dwindle to zero for one partition. – tadman Jun 26 '17 at 19:24
  • Well, wow, you are correct. As the operation runs it uses about 3MB/s very slowly dwindling from 9.9G free down to 9.3G free. Then it suddenly starts dropping 500MB/s until there is no space free and errors out. Do you have any idea why my query would use so much space when running? I'll update the question. – Thomas David Baker Jun 26 '17 at 19:50
  • No we don't know why it takes so much space. That is determined by the plan the dbms uses, the distribution of the data, how accurately this is reflected by the index statistics, the available indexes and the number of rows. – symcbean Jun 26 '17 at 21:30
  • How do we find out those things @symcbean? – Thomas David Baker Jun 26 '17 at 21:33
  • The plan is displayed using explain. The metrics are less obvious. If you don't have analytics available already, google is a good place to start. – symcbean Jun 26 '17 at 21:35
  • 1
    Your query in the FROM is evaluated first, written to disk and then the outer process starts executing: If it is too much data, create a temporary table first for the query in the FROM, maybe even add an index on that temporary table, and then use the output from that to run the second part of your query – Norbert Jun 27 '17 at 08:13
  • Thanks @NorbertvanNobelen. I get the same problem doing this: INSERT INTO initial_cache (`time`, name, price) SELECT `time`, name, MIN(price) AS price FROM price WHERE `time` > 1493337600 GROUP BY `time`, name Is that the kind of thing you mean? (Updated the question with this, too.) – Thomas David Baker Jun 27 '17 at 16:05

2 Answers2

1

Subqueries tend to use a lot of temp space when they run out of memory. One part however is a bit redundant: The check on time after the initial subquery: Rewriting for that gives (where the SUM(1) is just weird):

INSERT INTO cache (`time`, name, price, low, high, week, month, season)
SELECT
    MAX(`time`) AS `time`,
    name,
    MIN(price) AS price,
    MIN(price) AS low,
    MAX(price) AS high,
    SUM(CASE WHEN price = 1 THEN 1 ELSE 0 END) / SUM(1) AS week,
    SUM(CASE WHEN price = 1 THEN 1 ELSE 0 END) / SUM(1) AS month,
    SUM(CASE WHEN price = 1 THEN 1 ELSE 0 END) / SUM(1) AS season
FROM
    (SELECT
        `time`,
        name,
        MIN(price) AS price
    FROM price
    WHERE `time` > 1498442022
    GROUP BY `time`, name) AS tmp
GROUP BY name;

Which might be equivalent to:

INSERT INTO cache (`time`, name, price, low, high, week, month, season)
SELECT
    MAX(`time`) AS `time`,
    name,
    MIN(price) AS price,
    MIN(price) AS low,
    MAX(price) AS high,
    SUM(CASE WHEN price = 1 THEN 1 ELSE 0 END) / SUM(1) AS week,
    SUM(CASE WHEN price = 1 THEN 1 ELSE 0 END) / SUM(1) AS month,
    SUM(CASE WHEN price = 1 THEN 1 ELSE 0 END) / SUM(1) AS season
FROM price
WHERE `time` > 1498442022    
GROUP BY name;

However since the rewrite of the outer query looks strange, I doubt that this is the result you are looking for: Provide data and expected result for a better answer.

Norbert
  • 6,026
  • 3
  • 17
  • 40
  • Sorry, I should have put a real example, I was bit lazy. In the true query `week` is now - 1 week, `month` is now - 1 month, and `season` is a specific point in time in the past. So those timestamps are necessary to get the data I want to put into the `cache` table. I'll update the question with something more accurate. – Thomas David Baker Jun 26 '17 at 20:43
  • If the queries give the same result then there's something wrong with the database design. – symcbean Jun 26 '17 at 21:33
  • Which queries do you mean by "the queries"? – Thomas David Baker Jun 26 '17 at 21:50
0

I didn't solve this but I did work around it. What I did was have the program that inserts the data also insert data into a table like that formed by the subquery. Then I perform my outer query separately. So I now have a kind of two-stage caching. For some reason this all works without seemingly denting the disk space.

Thomas David Baker
  • 1,037
  • 10
  • 24