7

I am curious, should I ever delete anything? Right now I am building a site (for myself) which allows you to subscribe to users which you then receive a message every time the user has upload content.

Or comments, if there is a thread and someone writes a direct comment to your comment you get a message saying so. Should I ever delete those or just simply hide it?

Each subscription has three (64bit) int. id, commentId, recipientId. You can find out who wrote to you looking at the comment table via commentId. If i dont use delete it will have a 4th int saying the status (show, hidden/delete).

Should I leave them or delete them? If I should delete them, then why? I can see maybe when there is personal user that you must delete on request but other then that should I ever delete?

I don't know which SQL DB I'll be using.

-edit-

Thanks guys. Right now I'll delete nothing except for things i can generate. Such as that subscription thing about.

splattne
  • 28,508
  • 20
  • 98
  • 148
  • 1
    64bit INT's? How many posts are you expecting? A 64bit INT for a status code is also a bit OOT. Personally, I use CHAR(1) for status codes and use letters D=DELETED, N=NEW, U=UPDATED etc. It makes it easier to add new "states" later on. – Guy Jun 25 '09 at 09:58
  • @guy, AFAIK i can only have 64bit ints in sqlite3 db but i am very likely to be wrong and just not know how to create them properly. –  Jun 25 '09 at 10:53

8 Answers8

14

The company I work for offers software to people in certain regulated industries so generally I have the "never delete anything" attitude because if you delete anything you have lost the completeness of your audit trail. Instead mark the information as deleted (or move it to an archive version of the tables) and record who "deleted" it and when.

The only reasons to really delete stuff are

  • if you are running out of space (but disk is cheap these days)
  • for efficiency (but if your data structure is well indexed and not badly fragmented this will make little difference)
  • for legal reasons (if you are asked to remove someone's details you will most likely have to comply, depending on local data protection laws, or if the content itself infringes some law)

Your users may be thankful of nothing getting really deleted, if they accidentally delete something useful and you can get it back. And if an upset user who has previously provided valuable information to the site throws a hissy fit and deletes all their posts in vengeance you can retract the deletions easily.

One extra very important point: you shuold make it clear in your terms of service that information may not be really deleted when the user can not see it any more, and provide a route (if only "email x@x.x and ask for it to be done") for them to really delete data they have the right under relevant laws to ask to be deleted.

David Spillett
  • 22,754
  • 45
  • 67
  • Even moving something to a 'deleted' table can ruin your referential integrity. Just have a bit field for 'deleted' and maybe some auditing info (deleted date/time, user, etc) – Mark Henderson Aug 02 '09 at 21:33
  • 2
    If status (deleted, not deleted) changes very infrequently, for better performance you could partition the tables on this value. – Philip Kelley Aug 02 '09 at 22:33
  • I usually prefer the "marking deleted" approach, but sometimes (particularly when retro-fitting the not deleting of data to an existing structure that is used by a lot of existing code) a separate archive table can be more practical. – David Spillett Aug 02 '09 at 23:04
6

Typically, today's modern disk sizes and IO performance means you don't have to delete records to save space or maintain performance. Usually a "record deleted" field on the record can mark the record as deleted (or as other status's) with an audit trail.

Some industries mandate that you never delete "transactional" data for regulatory reasons. You would already know if you need to do this. If there is any payment information you will typically need to keep the data (or make the data available) for 7 years (UK accounting law).

For other purposes, there is actually a good reason to physically delete data.

If it's not there, it's not discoverable.

The Freedom of Information Act (in the UK) states that if the data is discoverable then it is included in the scope for any search. This includes "soft deleted" records and historical backups.

For some systems, we ensure we PURGE old records and reuse / destroy old backup tapes / files after 'so many' months to ensure that it's not available for FOI requests. (Servicing an FOI request that goes back several years and requires restoring of hundreds of old mailboxes from archive backups is VERY costly).

This is different from OPERATIONAL backups. We keep backups so we can restore in the event of disaster. We also have a "Records Store" for both paper based and electronic media that MUST be kept, and we copy out emails and such to that store.

Guy
  • 2,668
  • 2
  • 20
  • 24
  • 2
    In addition the Data Protection Act (again the UK) requireds that data is not kept for longer than is necessary, if a record that pertais to personal information is marked as deleted rather than deleted it is being kept, and by definition is no longer necessary so should be deleted. – Richard Slater Jun 25 '09 at 09:56
  • Absolutely (plus lame "minimum number of letters" filter) – Guy Jun 25 '09 at 10:00
0

My gut instinct is to never delete anything. You never know when you might need it. If I have to remove data from working tables for any reason I tend to move it to an archive table.

Having said that, this may be overkill if it's data for your own use and it's inconceivable there'd ever be any legal reason to see old data. You don't say that much about your application, but could one user demand to see old data on the grounds another use has libelled them?

JR

John Rennie
  • 7,776
  • 1
  • 23
  • 35
0

Deleting or not depends on the amount of resources that you have available and the amount of data you will collect. I've worked on projects before where deletions are not allowed. It just meant that all data items will get a start-date and an end-date. The data item would be valid during this period, not before, not after. Thus, you could "delete" something by setting the end-date to today.
Unfortunately, this also means that you would have to have to check the current date with this period for every data item that you want to select. With SQL, this would require an additional condition to your queries.
Actually, to make things worse, you could consider to even disable edits. When a data item gets edited, you just set the end-date to now and create a new data item with the same keys and the modifications. That way, you will collect a huge collection of data, but it will be very historical and nothing gets deleted. In this case, the start/end dates should also contain a time component. (And you have to worry about summertime when the clocks are turned backwards one hour.) But basically, your system would then only insert new items, not modify or delete anything.

Wim ten Brink
  • 1,055
  • 1
  • 6
  • 13
0

You have to decide if your data is worth saving for ever! Everyone say that disk is cheap, but thats not the whole truth. It depends on your storage solution and your environment.

If you use fibre channel disks on a SAN and you are running out of disk space, it is not cheap anymore when you need to add another diskarray due to lack of space in your array.

In your case it doesnt seem like you will be storing large amount of data, and diskspace may not be an issue, but how relevant is your data in 10 year?

Another thing to think of is overall performance, not only diskspace. I think it is a good idea to store historical data in another table or even in another database. This way i will have less maintenance etc. I know, there are other solutions to archive historical data, like partitioning, but if its data that is not used on a regular basis, why implement more complexity?

I have been working in large databases the last 6 years and the indexing strategy is crusial when you have a table with 500 000 000 records. :) If your query is using an index seek, but the index does not contain all data you need then a clustered index lookup will be used for every record you found in the index. Lets say you get 10 % of the table, you will end up with 50 000 000 clustered index lookup, and that is not cheap at all. It does not cost you money, but it will cost you performance.

/Håkan Winther

Hakan Winther
  • 481
  • 2
  • 5
0

Reasons why you shouldn't delete something:

  1. You might want it later

Reasons why you should delete something:

  1. You want to ensure no unauthorized person can read it again (e.g. a stored credit card number: if you erase it an intruder can't get it)
  2. You want to ensure that the information can't be requested from you (e.g. through Freedom of Information Act requests)
  3. You want to keep the data size small for space or speed reasons (proper indexing and partitioning can help with the speed issue).
  4. You are required to delete it by law (e.g. privacy laws).

It's always a trade-off, but the legal implications of keeping too much data around are important. Privacy and security are things often overlooked these days. Actual database performance may not require deleting data, unless the data sets are huge. Even a table with millions of rows and dozens of columns may not need deleting if you partition it properly and ensure your queries always use the proper partitions. As for a court order or FOIA request asking you for stored data, well, only you can decide how you feel about that, and how your customers feel. One reason I limit my use of Gmail is precisely this reason: my data is stored in the US (I'm in Canada) and US agencies can potentially access even my deleted mail.

Also keep in mind that privacy, security and FOIA laws vary from country to country; you need to be aware of these laws in every country where you operate. Maybe if your servers are all in one country that limits the reach of foreign laws, but maybe it doesn't. Consult a lawyer if your data is sensitive.

0

The question you really have to ask yourself is this: Is the cost of keeping the data (increased storage costs, liability of keeping data that can be deleted) cheaper than the cost of deleting the data (man-hours to write the deletion query, liability of deleting data that needs to be kept, and the possibility of downtime or reduced performance due to running the deletion query)? Whichever is cheaper, go with it.

phuzion
  • 2,213
  • 1
  • 19
  • 23
  • 1
    "The question you really have to ask yourself is"... do you feel lucky, punk? Well do ya? – Chris Aug 02 '09 at 16:01
0

One case where I can see archiving off-line and/or deleting data is when you run an OLAP query to summarize data, and store it in a summary table.

Monthly website statistics are a great example of this. Once you've generated a number of page views for June 2009, that's never going to change. And it's faster to add all of the page views from the summary table, then scan the table that contains the on-line transactions from the current month, than it is to scan an entire year's worth of logs and generate an entirely on-line report.

If it were me, I would be sure to copy the on-line table to 'june 2009', run the summary query and save the data to the summary table, and then archive the copied on-line table before deleting all entries from the original on-line table. But I'm also somewhat paranoid!

Generally, anywhere where it's more efficient to use OLAP to generate a summary against data that's static from that point forward, it is possible to archive/delete old data. Otherwise, no, I use a delete flagging system to avoid breaking relational integrity with my typically-extensive activity logging systems.

Karl Katzke
  • 2,596
  • 1
  • 22
  • 24