0

I've two tables , person and person_history to keep versioned records of a person entity.

person table will always have the latest version of the person entity, while person_history table keeps all the versions of the person.

The table person_history is growing exponentially in size as with every update of person, a new record is added to the history table.

The primary key of person table is referenced as person_id from the person_history table. Column version_num keeps track of versioning in the history table. With each update, version_num is bumped up by 1.

I wish to keep only 5 records per person_id, and purge the older ones.

For this I've prepared the below statement

DELETE
        FROM person_history
        WHERE id in (SELECT p0.id
        FROM person_history p0
        WHERE (
        SELECT COUNT(*)
        FROM person_history pi
        WHERE  p0.person_id = p1.person_id AND p0.version_num < p1.version_num
        ) >= 5);


This statement works, but is very slow and write operations are impacted at that time.

I tried adding order and limit to above condition to delete it in chunks and formed below query

DELETE
        FROM person_history
        WHERE id in (SELECT p0.id
        FROM person_history p0
        WHERE (
        SELECT COUNT(*)
        FROM person_history pi
        WHERE  p0.person_id = p1.person_id AND p0.version_num < p1.version_num
        ) >= 5)
        ORDER BY p0.id 
        LIMIT 1000);

This query fails with error This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

I've also tried creating a procedure and but that threw error too

DROP PROCEDURE IF EXISTS purge_history;

DELIMITER $$

CREATE PROCEDURE purge_history()

BEGIN   
    REPEAT
        DO SLEEP(1);
        SET @z:= (SELECT p0.id
            FROM person_history p0
            WHERE (
            SELECT COUNT(*)
            FROM person_history p1
            WHERE  p0.person_id = p1.person_id AND p0.version_num < p1.version_num
            ) >= 5 ORDER BY p0.id LIMIT 1000);
        DELETE
        FROM person_history
        WHERE id in z;
    UNTIL ROW_COUNT() = 0 END REPEAT;
END$$

DELIMITER ;

This failed with ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'z; UNTIL ROW_COUNT() = 0 END REPEAT;

I've tried it on MySQL 8 and Mariadb 10.9

Please suggest any alternative to above chunk delete query so that the writes are not impacted while delete is in progress.

1 Answers1

1

You could do it using ROW_NUMBER() -

SELECT person_id, version_num, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
FROM person_history

and for the delete -

DELETE ph
FROM person_history ph
JOIN (
    SELECT person_id, version_num, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
    FROM person_history
) t ON ph.person_id = t.person_id
   AND ph.version_num = t.version_num
   AND t.revs > 5;

UPDATE

I have set up a test table with 1M rows. Running select version of OP's query (which only retains latest 3 versions, not 5 suggested) I get 267,432 rows, but the distribution is likely to be very different.

Query 1: Original correlated sub-query

SELECT id
FROM person_history
WHERE id in (
    SELECT p0.id
    FROM person_history p0
    WHERE (
        SELECT COUNT(*)
        FROM person_history p1
        WHERE  p0.person_id = p1.person_id
        AND p0.version_num < p1.version_num
    ) >= 3
);
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY person_history index PRIMARY uq_person_ver 4 998896 100.00 Using index
1 PRIMARY p0 eq_ref PRIMARY PRIMARY 3 test.person_history.id 1 100.00 Using where
3 DEPENDENT SUBQUERY p1 ref uq_person_ver uq_person_ver 3 test.p0.person_id 3 33.33 Using where; Using index

Query 2: Rewritten correlated sub-query

SELECT ph.id
FROM person_history ph
JOIN (
    SELECT p0.id
    FROM person_history p0
    JOIN person_history p1
        ON p0.person_id = p1.person_id
        AND p0.version_num < p1.version_num
    GROUP BY p0.id
    HAVING COUNT(p1.id) >= 3
) t ON ph.id = t.id;
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY ALL 1148980 100.00
1 PRIMARY ph eq_ref PRIMARY PRIMARY 3 t.id 1 100.00 Using index
2 DERIVED p0 index PRIMARY,uq_person_ver PRIMARY 3 998896 100.00
2 DERIVED p1 ref uq_person_ver uq_person_ver 3 test.p0.person_id 3 33.33 Using where; Using index

Query 3: ROW_NUMBER() sub-query

SELECT ph.id
FROM person_history ph
JOIN (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
    FROM person_history
) t ON ph.id = t.id
    AND t.revs > 3;
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY ALL 998896 33.33 Using where
1 PRIMARY ph eq_ref PRIMARY PRIMARY 3 t.id 1 100.00 Using index
2 DERIVED person_history index uq_person_ver 4 998896 100.00 Using index; Using filesort

Observations

# Query 1 Query 2 Query 3
Rows examined 6,378,752 534,864 1,267,432
Rows returned 267,432 267,432 267,432
Execution time 5.95 3.75 1.10

Combining this with batching on the person_id should significantly reduce the overhead -

DELETE ph
FROM person_history ph
JOIN (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
    FROM person_history
    WHERE person_id BETWEEN 1 AND 50000
) t ON ph.id = t.id
    AND t.revs > 3;

I also tried these queries against the table with the surrogate PK replaced by PK on (person_id, version_num) but the improvement was negligible.

user1191247
  • 10,808
  • 2
  • 22
  • 32
  • I do not think that ROW_NUMBER() together with >5 is well optimized. The `SELECT` may be the slow part of this formulation. – Rick James Nov 23 '22 at 06:36
  • @RickJames - what do you think? I know my test table is much narrower than OP's table but the inner query should be entirely index based, so it shouldn't make any difference. – user1191247 Nov 23 '22 at 13:13
  • I tried @nnichols's solution of ROW_NUMBER() with a smaller batch of 10k and it solved the problem. The CRUD were working fine, while DELETE was running in the background. – Shubhansh Vatsyayan Nov 27 '22 at 12:20