I'm starting to see clients with hundreds of terabytes of data (in SQL Server installations). As the total volume of data in some enterprises approaches meaningful fractions of a petabyte, I'd like to canvas the collective knowledge base out there to see what people dealing with that magnitude of data are doing to safeguard it.
The obvious issue is that storing multiple backups of that much data is prohibitively expensive, using enterprise-class storage, heck, even just RAID-5.
Options I see are as follows:
- Create a mirror copy of the data in another data-center, and continually ship differences to it (using whatever mechanism is available for your data source - e.g. log-shipping or database mirroring with SQL Server)
- Take regular backups using a hefty compression algorithm (probably only suitable if the data lends itself well to being heavily compressed)
- Take piecemeal backups of the critical/changing parts of the data.
- Don't backup the data and trust the corruption-gods.
I'm seeing option #4 being adopted as the default, and as an HA/DR expert it's really scary, but what do I advise as an alternative? I think #1 is the best approach, but "I don't think so" is the usual answer when any alternatives apart from #4 and possibly #3 are suggested.
Now, of course it depends on the change-rate and criticality of the data. No need to answer with that as I used to be responsible for all the HA features of SQL Server while I worked at Microsoft so I'm well-versed in the 'it depends' arguments - that's my catch-phrase :-)
I'd be very interested to hear of any alternatives I've missed, or to hear that everyone else is in the same boat and there's no realistic alternative to spending lots of money on more storage.
Thanks in advance - due credit will be given to all well thought-out and expressed answers.