1

I have an extremely large table from which I would like to delete old rows. Example of table:

 | customer_id | first_purchase_date | last_purchase_date |
 |<primary key>|                     |   <index>          |

** I am using this example table for argument's sake. The table in question is not a customer table. The real table has grown to 28 GB in size over the past 2 months and is used to calculate something that requires only 2 weeks of historical data.

What I would like to do is delete customers from this table that have not purchased anything in the past year. I.e. delete from table where last_purchase_date < now() - interval 1 year;

Simpy deleting like this would be too costly on the database. I know that a partition can be used to truncate the old rows but I am not sure how to implement it effectively.

In addition, if a customer were to purchase something then that row could potentially move to a different partition by updating the last_purchase_date. Would this not be expensive as well?

Thank you in advance for any direction!

AYR
  • 1,139
  • 3
  • 14
  • 24
  • Are you sure you want to actually delete your customers? – Vitalii Strimbanu Feb 01 '17 at 15:42
  • 1
    Partinionning on date is a solution to get better results if performance is your issue... but is it your issue at all? And what is "an extremely large table"? Give some numbers. Deleting customers is rarely a good idea, moving them to an "archive" table would be better. How do you plan to delete them? will it be a yearly process (at end of 2017, delete all 2016) or a daily process (at end of each day, delete those that are < [date - 1year] ). To summarize, you really need to add more information – Thomas G Feb 01 '17 at 15:44
  • The table in question is not really a customer base – AYR Feb 01 '17 at 15:56
  • Short answer is - you can't. You can issue a `DELETE` and then I/O takes over, meaning it's as fast as I/O of your server is. There's no witty SQL or magic setting that you can use, records need to be 1) found 2) removed and this last part, the removal, isn't really done so your tablespace will actually still be occupied (I am assuming you're using InnoDB). What will happen is that deleted records will be kicked out of `innodb_buffer_pool` so you'll simply use less RAM. These strategies need to be considered in advance sadly. – Mjh Feb 01 '17 at 16:13
  • @Mjh True, I am using InnoDB. The table is never used except by the same procedure that would delete from the table. Question is if partitioning would help here... if so, how? – AYR Feb 01 '17 at 16:16
  • 1
    Welcome to the world of data and analysis, you're going to battle something we're all battling with all the time :) sadly, no one can tell you whether it would help and how. You'd have to measure. Your data is probably a bit more complex than what you wrote here, and the way to solve this problem is to think in a different direction - how to have a database that contains accurate historic data, without all the information that was used to extrapolate it. You can end up in many alleys with your potential solution. I doubt I can be of help to you with this issue. – Mjh Feb 01 '17 at 16:29

1 Answers1

2

You are right in thinking that partitioning is the way forward, because:

Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.

And if this doesn't work for you, it's still possible to

In addition, MySQL 5.7 supports explicit partition selection for queries. For example, SELECT * FROM t PARTITION (p0,p1) WHERE c < 5 selects only those rows in partitions p0 and p1 that match the WHERE condition. In this case, MySQL does not check any other partitions of table t; this can greatly speed up queries when you already know which partition or partitions you wish to examine. Partition selection is also supported for the data modification statements DELETE, INSERT, REPLACE, UPDATE, and LOAD DATA, LOAD XML.

Since you want to delete stuff based on date and not the primary key, what you need is a RANGE partition scheme.

First find the oldest date and create partitions based on that

ALTER TABLE sales
    PARTITION BY RANGE( TO_DAYS(last_purchase_date)) (
    PARTITION p0 VALUES LESS THAN (TO_DAYS('2018-12-31')),
    PARTITION p1 VALUES LESS THAN (TO_DAYS('2017-12-31')),
    PARTITION p2 VALUES LESS THAN (TO_DAYS('2016-12-31')),
    PARTITION p3 VALUES LESS THAN (TO_DAYS('2015-12-31')),
    ..
    PARTITION p10 VALUES LESS THAN MAXVALUE));

Choose an appropriate number of partitions but don't worry too much because you can always change partitions later. When you partition, you might even find that the delete step isn't really needed after all.

e4c5
  • 52,766
  • 11
  • 101
  • 134
  • Using `PARTITIONing` is an excellent way to facilitate deleting old rows. [_Here_](https://mariadb.com/kb/en/mariadb/partition-maintenance/) is my blog on such. – Rick James Feb 01 '17 at 18:59
  • @e4c5 Thank you for the thorough answer. The question is if using a partition wouldn't significantly affect the update on the date column as it would have to move the row to a different partition. – AYR Feb 01 '17 at 19:02
  • Ooops, sorry this scheme wouldn't work if the date keeps changing. It would have to be static. Can you post your full table so that we can identify a suitable column – e4c5 Feb 01 '17 at 23:40
  • @e4c5 Except for the names of the columns that is the full table. Its a table which allows to me see when a "customer" enters "learning mode" and when was the last time the row appeared. A "customer" that hasn't been around for over two weeks is not relevant to the algorithm which makes use of this table. – AYR Feb 02 '17 at 06:07
  • why don't you partition on first purchase_date? it's logical that the vast majority of cases the first purchase date will be close to the last_purchase date – e4c5 Feb 02 '17 at 06:10
  • @e4c5 Assuming that you are right and the first purchase date will be, on average, close to the last, I still cannot drop an entire partition in this case . Instead I would still be deleting by the last_purchase_date. – AYR Feb 02 '17 at 06:17
  • Deleting by the last purchase date in these partitions would still be better because you are not effecting the entire table. – e4c5 Feb 02 '17 at 06:31