0

I have three tables:

grade         (grade_id,       grade_value,       grade_date)     ~100M rows
grade_archive (grade_id,       grade_value,       grade_date)         0 rows
peer_review   (grade_id, peer_review_value, peer_review_date)      ~10M rows

I want to move all rows from table grade to grade_archive that are older than a month and are not in table peer_review.

The tables are actively used so any insert would have to be low priority to avoid interrupting existing and new processes while this runs.

Expected table rows should look something like this when done:

grade          ~10M rows
grade_archive  ~90M rows
peer_review    ~10M rows

I imagine it's something close to:

INSERT
    LOW_PRIORITY
    INTO grade_archive
        (grade_id,grade_value,grade_date)
    SELECT
        grade_id,grade_value,grade_date
    FROM
        grade
    WHERE
            grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
        AND grade_id NOT IN
            (
                SELECT grade_id FROM peer_review
            );

And then clean up the grade table by removing all rows in the archive table:

DELETE LOW_PRIORITY FROM grade WHERE grade_id IN (SELECT grade_id FROM grade_archive);

But these subselects are very slow with large tables and I'm nervous about the result. Looking for some better direction.

Ryan
  • 14,682
  • 32
  • 106
  • 179

2 Answers2

1

I have had a similar problem in the past with migrating a portion of data from a large active tables into an archive table. The approach I used (modified for your use case) is as follows:

/* Set time for calculation basis */
SET@calc_time = NOW();
/* Create empty copy of grade table */
CREATE TABLE grade_temp LIKE grade;
/* Add rows you want to save from grade into temp table */
INSERT INTO grade_temp
SELECT
    g.grade_id AS grade_id,
    g.grade_value AS grade_value,
    g.grade_date AS grade_date
FROM grade AS g
LEFT JOIN peer_review AS pr
  ON g.grade_id = pr.grade_id
WHERE
/*
To keep the record it must either have an entry in peer review
or it is less than a month old
*/
    pr.grade_id IS NOT NULL
    OR g.grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/*
Switch new temp table for active table.
This happens really fast (it is just file name switching on the system).
*/
RENAME TABLE grade TO grade_old, grade_temp TO grade;
/*
You are now taking new records into new version of grade table
and free to do your much slower operations against the grade_old table
*/
/* Delete more recent rows */
DELETE FROM grade_old
WHERE grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Delete rows that exist in peer review */
DELETE FROM grade old
WHERE grade_id IN (
    SELECT grade_id
    FROM peer_review
    WHERE grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH)
);
/*
As an alternate to the above action, you could also try deleting across join as shown below. Which is faster will likely depend upon number of records that are returned from that subquery shown above. You can try both out and see what works best
*/
DELETE go FROM grade_old AS go
INNER JOIN peer_review AS pr
  ON go.grade_id = pr.grade_id
WHERE pr.grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Add all rows from grade_old to grade_archive */
INSERT INTO grade_archive
SELECT
    grade_id,
    grade_value,
    grade_date
FROM grade_old;
/* Drop date_old table */
DROP TABLE date_old;

The key here is in getting a new version of grade table which contains only the rows needed in place as quickly as possible and then sort out what goes into the archive table after the fact. You don't want to do any mass delete operations on a table of that size. This keeps the time where you grade table is tied up on these archival operations to a minimum.

I will say however that your database schema seems like it could be optimized for this sort of operation. For example, you could have a peer review flag on your grade table that you could use for quicker filtering rather than having to filter across a join. I am actually questioning the need for that peer review table as a whole unless it has a many to one relationship with grade table (which doesn't seem to be indicated in your question). If there is only ever one peer review entry per grade_id, I would think that these columns should just be normalized into the grade table. That would greatly simplify this maintenance process.

Mike Brant
  • 70,514
  • 10
  • 99
  • 103
  • Good idea about the peer_review_flag on the grade table. If I added that, would I just skip the joins above and proceed as written or is there an even faster way? – Ryan Mar 02 '15 at 19:38
  • 1
    In that case you could just replace the where conditions related to the peer_review table and the related joins/subselects in favor of where conditions on the grade and grade_old tables. In reading through your question, I do not fully understand the need for the peer_review table at all. The number you provide around the rows in each table would seem to suggest a one to one relationship between grade and peer_review table. If that is the case, then you might be best advised to add the peer review column onto the grade table to make everything much simpler. – Mike Brant Mar 02 '15 at 19:42
1

Since NOT IN ( SELECT ... ) is miserably slow, use LEFT JOIN .. IS NULL to get the same effect:

SELECT  g.grade_id, g.grade_value, g.grade_date
    FROM  grade AS g
    LEFT JOIN  peer_review AS p USING(grade_id)
    WHERE  g.grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
      AND  gi.grade_id IS NULL ; 

No explicit tmp table needed.

Rick James
  • 135,179
  • 13
  • 127
  • 222