0

I have a large table (About 10 million rows) that I need to delete records that are "older" than 10 days (according to created_at column). I have a python script that I run to do this. created_at is a varchar(255) and has values like for e.g. 1594267202000

import mysql.connector
import sys
from mysql.connector import Error

table = sys.argv[1]
deleteDays = sys.argv[2]

sql_select_query = """SELECT COUNT(*) FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))""".format(table)
sql_delete_query = """DELETE FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY)) LIMIT 100""".format(table)

try:
    connection = mysql.connector.connect(host=localhost,
                                         database=myDatabase,
                                         user=admin123,
                                         password=password123)
    cursor = connection.cursor()

        
    #initial count of rows before deletion
    cursor.execute(sql_select_query, (deleteDays,))
    records = cursor.fetchone()[0]


    while records >= 1:
        # stuck at following below line and time out happens....
        cursor.execute(sql_delete_query, (deleteDays,))
        connection.commit()
        cursor.execute(sql_select_query, (deleteDays,))
        records = cursor.fetchone()[0]

    #final count of rows after deletion
    cursor.execute(sql_select_query, (deleteDays,))
    records = cursor.fetchone()[0]

    if records == 0:
        print("\nRows deleted")
    else:
        print("\nRows NOT deleted")

except mysql.connector.Error as error:
    print("Failed to delete: {}".format(error))
finally:
    if (connection.is_connected()):
        cursor.close()
        connection.close()
        print("MySQL connection is closed")

When I run this script and it runs the DELETE QUERY however... it fails due to:

Failed to delete: 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

I know that the innodb_lock_wait_timeout is currently set to 50 seconds and I can increase it to overcome this problem, however i'd rather not touch the timeout and.... I want to basically delete in chunks maybe? Anyone know how I can do it here using my code as example?

Saffik
  • 911
  • 4
  • 19
  • 45
  • 1
    Do you have an index on created_at? – vestronge Jul 31 '20 at 05:22
  • What type is created_at? This would be more efficient if it was a datetime and indexed. – Schwern Jul 31 '20 at 05:27
  • created_at has no index and is a varchar(255) – Saffik Jul 31 '20 at 05:31
  • Create MySQL stored procedure which deletes old rows and simply call it from python code. If the amount of deleted rows is large, and this affects on another users, then delete by chunks (100 or 100 rows per chunk). – Akina Jul 31 '20 at 05:33
  • 1
    *`WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))`* - this condition is wrong, because column value division does not allow to use index by this column. Convert it to *`WHERE created_at < 1000 * UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))`* - all arithmetic operations are moved to constant part and hence are calculated only once rather than original form where the calculations are performed for each separate row. Always use non-changed column value in a condition when it is possible. – Akina Jul 31 '20 at 05:36
  • @Akina Another idea: Run the delete cleanup proc or query at some low volume time, such as midnight Saturday. – Tim Biegeleisen Jul 31 '20 at 05:41
  • @Akina the created_at holds linux epoch timestamp in milliseconds. i had to divide by 1000 to get it to unix_timestamp format if that makes sense. – Saffik Jul 31 '20 at 05:41
  • @TimBiegeleisen This is good solution, of course... but it is extensive. – Akina Jul 31 '20 at 05:43
  • *i had to divide by 1000 to get it to unix_timestamp format if that makes sense.* It doesn't matter does you convert timestamp to seconds or the value to compare with to milliseconds - this cannot affect on the comparing result. But when you convert column value you increase the amount of work which must be performed by the server. – Akina Jul 31 '20 at 05:45
  • @Akina it also seems kinda inefficient to force mysql to compute the timestamp every time when you could compute the cutoff date in python once in a format matching `created_at` and just hand that out to mysql directly. – Masklinn Jul 31 '20 at 07:31
  • @Masklinn - can you give/write an example for it using my script pls? Beginner here... – Saffik Jul 31 '20 at 07:40
  • @Masklinn I do not see the reason for to calculate the value on the python side and use it on MySQL side when MySQL may both calculate and use. Excess parameters transferring, excess python code, and no visible profit. – Akina Jul 31 '20 at 08:26
  • @akina there's no excess parameter transferring or python code, it replaces the existing parameter and avoids work on the server side. – Masklinn Jul 31 '20 at 09:46
  • @Masklinn *there's no excess ... python code* You tell *you could compute the cutoff date in python* which needs additional python code, isn't it? *avoids work on the server side* Single DATESUB() execution is the work? – Akina Jul 31 '20 at 10:00

2 Answers2

0

One approach here might be to use a delete limit query, to batch your deletes at a certain size. Assuming batches of 100 records:

DELETE
FROM yourTable
WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))
LIMIT 100;

Note that strictly speaking you should always have an ORDER BY clause when using LIMIT. What I wrote above might delete any 100 records matching the criteria for deletion.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Yes, I can sure add a LIMIT 100 but my query will execute it once as is? How do I repeatedly execute it until all rows are deleted? – Saffik Jul 31 '20 at 05:30
  • @Saffik You may loop and call delete in batches until no more records are left. Most likely, whatever SQL driver you are using would return the number of records deleted/affected, so you may use this count. – Tim Biegeleisen Jul 31 '20 at 05:40
  • I updated my question and added a while loop.... but timeout still happening? – Saffik Jul 31 '20 at 07:26
0

created_at has no index and is a varchar(255)Saffik 11 hours ago

There's your problem. Two of them.

It needs to be indexed to have any hope of being performant. Without an index, MySQL has to check every record in the table. With an index, it can skip straight to the ones which match.

While storing an integer as a varchar will work, MySQL will convert it for you, it's bad practice; it wastes storage, allows bad data, and is slow.

Change created_at to a bigint so that it's stored as a number, then index it.

alter table your_table modify column created_at bigint;
create index created_at_idx on your_table(created_at);

Now that created_at is an indexed bigint, your query should use the index and it should be very fast.


Note that created_at should be a datetime which stores the time at microsecond accuracy. Then you can use MySQL's date functions without having to convert.

But that's going to mess with your code which expects a millisecond epoch number, so you're stuck with it. Keep it in mind for future tables.

For this table, you can add a generated created_at_datetime column to make working with dates easier. And, of course, index it.

alter table your_table add column created_at_datetime datetime generated always as (from_unixtime(created_at/1000));
create index created_at_datetime on your_table(created_at_datetime);

Then your where clause becomes much simpler.

WHERE created_at_datetime < DATE_SUB(NOW(), INTERVAL %s DAY)
Schwern
  • 153,029
  • 25
  • 195
  • 336