0

I have a huge database, it process the email traffic everyday. In the system, it needs delete some old emails everyday:

Delete from EmailList(nolock) 
WHERE EmailId IN ( 
    SELECT EmailId 
    FROM Emails 
    WHERE EmailDate < DATEADD([days], -60, GETDATE())
)

It works, but the problem is: it takes a long time to finish and the log file becomes very huge because of this. The log file size increases more than 100GB everyday.

I'm thinking we can change it to

Delete from EmailList(nolock) 
WHERE EXISTS ( 
    SELECT EmailId 
    FROM Emails 
    WHERE (Emails.EmailId = EmailList.EmailId) AND 
        (EmailDate < DATEADD([days], -60, GETDATE()))
)

But other than this, is there anything we can do to improve the performance. most of all, reduce the log file size?

  • EmailId is indexed.
LittleBobbyTables - Au Revoir
  • 32,008
  • 25
  • 109
  • 114
urlreader
  • 6,319
  • 7
  • 57
  • 91
  • 4
    There is absolutely no point in using NOLOCK on a DELETE. NOLOCK is not a valid hint for DML operations. – Aaron Bertrand Jul 06 '12 at 19:19
  • "EmailId is indexed." In which table(s)? Can you post the query plan? – Mark Byers Jul 06 '12 at 19:19
  • 1
    what is the recovery mode of your db ? and what is your strategy to reduce log size ? – tschmit007 Jul 06 '12 at 19:30
  • @Aaron, I am kind of surprised the SQL Statement even parses correctly with that in there. I was thinking the same exact thing. – Sage Jul 06 '12 at 19:36
  • I copied the script from a DBA, and I actually do not know where the nolock come from. the code is c#, and it has no nolock in it. sorry for the confusion. the recovery mode is simple. frankly, I did not even think that the tempDB can increase so much in a short time. – urlreader Jul 06 '12 at 20:25

3 Answers3

0

Ive seen

GetDate()-60

style syntax perform MUCH better than

DATEADD([days], -60, GETDATE()))

especially if there is an index on the date column. A few fellow DBAs and I had spent quite a bit of time trying to understand WHY it would perform better, but the result was in the pudding.

Another thing you might want to consider, considering the volume of records I presume you have to delete, is to chunk the deletes in batches of say 1000 or 10000 records. This would probably speed up the delete process.

Sage
  • 4,769
  • 1
  • 21
  • 28
  • [code]WHILE EXISTS ( SELECT TOP 1 * FROM table WHERE ) BEGIN SET ROWCOUNT 1000 DELETE Table WHERE SET ROWCOUNT 0 END will this work fine? I beleive batch delete will truncate the log, but will it slow down the system (since it need truncate in every batch)? I developed the c# code, never thought it will be used to process huge data. thanks for any suggestions. – urlreader Jul 06 '12 at 20:32
  • If I do the delete in batch, do I need manually truncate between batches? or the database will do it automatically? – urlreader Jul 06 '12 at 21:22
  • truncate what? As you delete you enver truncate the table. – TomTom Jul 07 '12 at 07:09
0

[EDIT]:

regarding @TomTom's comment: If you have SQL Server Enterprise edition available you should use Table Partitioning.

If this is not the case, my original post may be helpfull:


[ORIGINAL POST]

Deleting a large amount of data is always difficult. I faced the same problem and I went with the following solution:

Depending on your requirements this will not work, but maybe you can get some ideas from it.

Instead of using 1 table, use 2 tables, with the same schema. Create a synonym (I assume you are using MS SQL server) that points to the "active table of the 2 tables (active means, this is the table that you currently write to). Use this synyonym for the inserts in your application, or instead of using the synonym, the application could just change the table each x days it writes to.

Every x days you can truncate the old/inactive table and afterwards recreate the synonym to target the truncated table (if you use the synonnym solution), so effectively you are partitioning the data per time.

You have to synchronize the switch of the active table. I automated this completely, by using a shared App-lock for the application, and an Exclusive Applock when changing the synonym (== blocking the writing application during the switching process).

If changing your applicaiotn's code is not an option, consider using the same principle but instead of writing to the synonym you could create a view with instead of triggers (the insert operation would insert into the "active" partition). The trigger code would need syhcnronize using something like the Applock as mentioned above (so that writes during the switching process work).

My solution is a litte more complex, so I currently cannot post the code here, But it works without problems for a high load application and the swithcingt/cleanup process is completely automated.

Bernhard Kircher
  • 4,132
  • 3
  • 32
  • 38
  • ? Deleting large amonts by date in a scnario like this is trivial and goes in a sub second time without any issues. THere are particular technolgoies developed for this, i.e. table partitioning. Anyone dealing with hugh data amounts should know his way around this. – TomTom Jul 06 '12 at 20:18
  • thanks. it sounds like a god idea. will check it later. right now, I want to figure out whether there is a simpler way to achieve the goal, so it is unnecessary to change too much the c# code. still, appreciate you point out a promising solution. thanks again. – urlreader Jul 06 '12 at 20:28
  • @TomTom yes there are technologies, but if you are using MS Sql this required Enterprise edition, which was not available in my situation (other db systems have similar conepts too, but I assumed that this is MSSQL server and not enterprise edition - whci hmay be wrong). But this is a good point - if there are technologies available, use them. – Bernhard Kircher Jul 06 '12 at 20:41
  • Then, sorry, you do not have HUGH amonts of data. You do not even have VERY LARGE amounts of data. Sure, small databases have problems here, but in my world "hugh" is "a billion rows or more", or more exact "100 million rows per day, storing for some years". It is larger than very large, and VLDB are a defiend size. – TomTom Jul 06 '12 at 20:42
  • Yes this is far from my experience, and yes I assume in these scenarios you have the budget and infrastructure to handle these kind of problems. But I still think that there are some scenarios where my post may be helpful. And as you mentioned - huge/large are relative values.... – Bernhard Kircher Jul 06 '12 at 20:51
  • very large is NOT. VLDB is an accepted database size these days - it is used in a lot of metrics. – TomTom Jul 06 '12 at 21:10
0

Have you tried parition by date, then you can just drop the table versions for the days yo uare not interested in anymore. Given a "hugh" database you for sure do run enterprise edition of SQL Server (after all, hugh is bigger than very large) and that has table partitioning.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • thanks for mentioning the table partition. will check it. however, the program is used by different users and I can not assume they all use enterprise version. maybe have to check the data size and sql version in c# code first, then use different way to delete. but it seems should have a better/straight way to do this, although I could be wrong. thanks. – urlreader Jul 06 '12 at 20:45
  • Ah - no. It would be enough to stand up and say you do not ahve "hugh amounts of data". Hugh sounds larger than VLDB which is like hundreds of gigabytes now - my last db was 21000gb ;) You dont run that with normal standard edition. – TomTom Jul 06 '12 at 21:10
  • alright, fair enough. so, for not-hugh-data database, what's the suggestion? batch delete? do I need manually do something between the batches? i.e. truncate? or, let sql server do it automaitcally? thanks – urlreader Jul 06 '12 at 21:23
  • Delete - but not all of them, only 10.000 to 50.000 items per transaciton, in a loop. And Log WILL go up - live with that, this is how it is done and made, temptc will not with this approach. Run loop until you do delete only 0 items, then all the stuff is gone. – TomTom Jul 07 '12 at 02:47
  • I tried to use loop to delete. the database recovery mdoe is simple, and the batch size is 2000. It takes much much longer time to delete. is it normal? If I increase the batch size, I would guess it will be faster. However, my question is: what # is a good choice for the batch size? So that, it will slow down too much and keep the log file size in a reasonable range. will it better if I check how many rows will be delete first, then set batch as 10% of the #? – urlreader Jul 10 '12 at 01:02