19

I'm starting to see clients with hundreds of terabytes of data (in SQL Server installations). As the total volume of data in some enterprises approaches meaningful fractions of a petabyte, I'd like to canvas the collective knowledge base out there to see what people dealing with that magnitude of data are doing to safeguard it.

The obvious issue is that storing multiple backups of that much data is prohibitively expensive, using enterprise-class storage, heck, even just RAID-5.

Options I see are as follows:

  1. Create a mirror copy of the data in another data-center, and continually ship differences to it (using whatever mechanism is available for your data source - e.g. log-shipping or database mirroring with SQL Server)
  2. Take regular backups using a hefty compression algorithm (probably only suitable if the data lends itself well to being heavily compressed)
  3. Take piecemeal backups of the critical/changing parts of the data.
  4. Don't backup the data and trust the corruption-gods.

I'm seeing option #4 being adopted as the default, and as an HA/DR expert it's really scary, but what do I advise as an alternative? I think #1 is the best approach, but "I don't think so" is the usual answer when any alternatives apart from #4 and possibly #3 are suggested.

Now, of course it depends on the change-rate and criticality of the data. No need to answer with that as I used to be responsible for all the HA features of SQL Server while I worked at Microsoft so I'm well-versed in the 'it depends' arguments - that's my catch-phrase :-)

I'd be very interested to hear of any alternatives I've missed, or to hear that everyone else is in the same boat and there's no realistic alternative to spending lots of money on more storage.

Thanks in advance - due credit will be given to all well thought-out and expressed answers.

Paul Randal
  • 7,194
  • 1
  • 36
  • 45
  • Having some idea of the the scale of updates to the database(s) would make a difference in backup options. – Dave Dustin May 31 '09 at 07:20
  • 1
    And the follow-up question - Is there a good way to restore a backup of a petabyte database? – Rob Boek May 31 '09 at 07:22
  • "it depends" is also Joel Spolsky's catch phrase. You may have to fight him for it! – Nick Kavadias May 31 '09 at 11:35
  • I just love how all the responses bypass the main question of "how to store the data" with "why do you need to store the data?" It's like that joke about the hammer: do you have hammer i could borrow? why do you need it? I need to hammer a nail. Why do you need to do that? To hold down the roof. Why do you need a roof? So that the rain does not pour into my house. Oh - no sorry I don't have a hammer. – Andriy Drozdyuk Jun 01 '09 at 20:28
  • Drozzy - but that's an orthogonal question to what I'm asking. Assume they need to store the data and the vast majority needs to be online. Think Hotmail for instance, one of our customers. – Paul Randal Jun 01 '09 at 20:54
  • You dont backup a petabyte-scale storage, you use a forward error correction mechanism (like erasure coding) instead and make the storage itself fail-safe. Erasure coding is a generalization of RAID 6 and can stand multiple failures at the cost of a reasonable data overhead. – Kozuch Apr 08 '17 at 13:47

12 Answers12

7

Off the wall idea - is the all of the stored information needed or even useful?

How much is the information actually worth? It seems obviously ridiculous to spend more in upkeep and management than the data is worth.

Is the data in the database appropriate for storage in a database? For example, does keeping compressed multi-gigabyte core files in the support organization's database really provide any actual benefit?

Is there a lot of duplicated data in the database? For example, are a thousand people keeping ten copies each of a weekly 10MB newsletter?

Does some of the data have an "expiration date" after which it does not provide any value? Returning to the support organization example, for various reasons there is virtually no benefit in keeping around customer core files more than a few months after a fix has been delivered.

Another thought - is keeping that much data opening the company to liabilities. Some data one must, by law, keep. Some data, however, should be "shredded" because of the risks posed if it is accidentally, or maliciously, released to inappropriate parties.

pcapademic
  • 1,670
  • 1
  • 15
  • 22
6

Yeah, another option is storage virtualization: a device that sits between your servers and the SAN, like IBM SVC. SVC manages SAN-to-SAN copies, and can do remote replication (although that's obviously pretty painful at the petabyte level unless you have really low data change rates and really high bandwidth.)

The slick part is that the whole process is invisible to the servers involved. If you're using SQL Server, you design your filegroups to keep things with a low change rate together (like sales archives from >3 years ago), and things with a high change rate (like current sales) on a separate filegroup. They don't even have to be completely read-only - you just want to design it so that you can use different replication methods for each filegroup. The SAN gear can sync luns via network, tape, or via SANs - meaning, you can ship parts of the SAN back and forth. This is more effective with gear like LeftHand's, where the SAN is made up of a pool of participating units.

Then you can sync the low change rate stuff over the wire automatically, and sync the high change rate with sneakernet. (Sounds like I've got that backwards, but it's true - you can't sync the high change rate stuff over the wire due to volume.) Even some of the low-end gear accommodates this now: LeftHand lets you replicate to other LeftHand units in your datacenter, and then ship them to your offsite datacenter. Plug 'em in, join them to the remote side by changing IPs and groups, and now they're part of your remote backup SAN. The LeftHand sales pitch on this is just brilliant: set up your two SANs side-by-side in your primary datacenter, get them in sync, then you can ship parts of them over to the remote datacenter while some of them stay in your current datacenter to keep in sync. Gradually move 'em over without getting way out of sync.

I haven't done this at the petabyte level, though. You know what they say - in theory, in theory and in practice are the same. In practice...

Brent Ozar
  • 4,425
  • 18
  • 21
  • Hi Brent, is there hardware available which compresses data at the SAN level? – SuperCoolMoss May 31 '09 at 14:46
  • SuperCoolMoss - yep, absolutely. NetApp bundles dedupe into its SANs for free now, for example. Check with your SAN vendor and ask what dedupe solutions they offer. – Brent Ozar Jun 01 '09 at 11:54
  • And you're welcome, Paul. :-D – Brent Ozar Jun 01 '09 at 11:54
  • We were running the incipient virtualization software for a while. Ended up uninstalling from the switches due to some issues. Sounded great, but didn't work out for us. – Sam Jun 01 '09 at 23:36
3

Option 1 is mirroring, which is almost as bad as #4: any bug that corrupts data, and isn't discovered immediately, will corrupt both copies.

If the data is critical, consider dedicated solutions; read about IBM's Shark products, for example, or competing products from EMS, etc. They have features like Flash-copy, that allow you to instantly create a logical copy of the file without doubling disk requirements; and then you can backup this copy to (e.g.) tape. Look into robotic tape backup as well.

  • Database mirroring in SQL Server ships log records, not physical pages so most corruptions don't get copied to mirror. Yup, anything that allows a split-mirror+backup to be taken, but still left with problem of where to put damn thing if its a PB. But anything that is diffs-only-from-the-original (e.g. db snapshots in SQL Server) is heavily susceptible to corruption of underlying source data, making diffs useless too. Have you tried storing a PB on tape + restoring it during disaster recovery? *Days* of downtime :-( Although still better than total data-loss. Thanks for answer! – Paul Randal May 31 '09 at 06:48
3

Point out to those that want to store a Petabyte of data that storage ain't cheap.

I get so fed up with people moaning about not having an extra Terabyte of online storage because disc is cheap - disc may be, but managed storage sure as hell isn't.

If it's prohibitively expensive to store the backups then it's prohibitively expensive to store the data in a safe manner, so the proposed solution isn't viable.

One of the most important reasons for having backups is protection from user error (most hardware failure problems can be dealt with by hardware solutions) but even database mirroring is no protection against a dropped table (OK, you can protect against that, but it's still possible to get unremovable guff into your DB - unless the reason the DB is so big is that it only ever issues inserts).

As I see it tape is no longer a viable solution - it is now cheaper to just work with disc arrays (though physical storage can be awkward). So I think your only option is some method of splitting the data into chunks small enough to be restored in a sensible timeframe and then getting them onto disc storage on a regular basis (and here EMS type solutions can help, if you've got the cash).

  • Yup - I'm proposing option #3 more and more - use data-based partitioning of the data if you can and only backup the most recent data frequently - but you'd be surprised at the number of people who want to support VLDBs with archaic schemas and still expect to be able to efficiently backup, manage, and maintain the data. I'd have to agree with you about tape, for VLDBs you may as well go with disk and pay the cost as a trade-off against fast recovery time. Thanks for the answer! – Paul Randal May 31 '09 at 07:08
  • 1
    I agree. If you can't afford a backup solution, you can't afford the storage. Too many people see storage as just the price of the disks. – Mark Henderson Jun 18 '09 at 21:24
3

Interesting video detailing myspace.com's architecture (SQL2005 backend). Not sure if they have individual petabyte dbs as they scale out with multiple dbs. They use SAN snap backups.

http://wtv.watchtechvideos.com/topic70.html

SuperCoolMoss
  • 1,262
  • 11
  • 20
2

ZFS. Sure, it's still just getting started, but there are a number of areas where ZFS is designed to handle just these sort of thing. First off it's ability to handle a large amount of data, as well as a multitude of different storage devices (local, SAN, fiber, etc.), all while keeping data safe with checksums and "layer violating" awareness of the device health and failures. How though does this help solve backing up this much data?

One method is to use snapshots. Take a snapshot, send that to tape/disk/net for transfer to the remote site. Subsequent snapshots only send data that's been sent, and you can keep live data on both ends if need be.

The other is to use Solaris Cluster software where (so long as you have sufficent network bandwidth) you can have a live mirroring between two servers and if the one goes down, the second can take over. It's more for use where high availability (HA) is important, but I would guess that most places with that much data want HA.

And you say that ZFS isn't supported on Windows, the usual place you might find sqlserver, maybe you run the Sun/ZFS on the backend and connect via iSCSI. Maybe that's a horrid idea also, but it's at least worth giving some thought so you know what not to do.

jasonrm
  • 91
  • 2
2

Have you looked into Amazon Glacier as an option?

alex9183
  • 105
  • 7
1

IMO, unless you have some kind of godzilla-level hardware, if you have that much data you should be using a backup compression technology. I'm most familiar with LiteSpeed, but there are similar products from other vendors and (of course) a similar feature is built into SQL2008. You might not get 10-to-1 compression, but it does cut storage requirements for the backup down, and can also shrink your backup window requirements. If your goal is to keep multiple backup sets (yesterday plus the day before that, plus one from last week and one from last month, or a series of differentials plus fulls, which can get plenty big if you change a lot of the data in the database), it's a simple matter of storage space.

Filegroup based backup (IOW, put non-volatile data onto certain FGs and the back the up infrequently) never seems to fly because devs or users won't or can't decide what data is volatile and what isn't, and in brownfield scenarios you often can't take the risk.

If a failover site is a requirement, in addition to thinking about Database Mirror) you might want to talk to your clients' storage vendor to see if they offer something like SRDF, which is a hardware-based data replication technology. Naturally, replication (of any sort, but particularly realtime or near-realtime replication) is not a substitute for backups.

Darin Strait
  • 2,012
  • 12
  • 6
  • I'm really looking forward to the time when I can get a data dedup storage solution. It's not going to be any time soon, but the nature of my data would probably lead to a cut in size-on-disk of like 75% – Matt Simmons May 31 '09 at 12:52
  • Yup - backup compression is my option 2, but often another DC is required. I like the idea of have a remote SAN with different ways of syncing LUNS. Thanks – Paul Randal May 31 '09 at 15:21
1

I don't think you have much of a choice here on tape v. disk. Tape won't likely cut it in a regular backup window unless you stripe it, and I'm not sure the reliability is there.

So you are down to disk backups. Are you versioning? Meaning do you worry about going back to backup 2 (current db minus 2 backups)? Or backup 3? In that case, you might have issues, but likely what you have to handle is log backups, not so much data backups.

If you can split off some of the data as read-only/non changing, then perhaps you have manageable backup sizes/windows. Or at least you are hoping that backup technology and bandwidth is catching up with data growth.

I don't think you're backing up as much as you are keeping a 2nd copy around in order to recover from issues with your primary. That means hardware, corruption, etc., and you are praying daily that errors aren't being shipped to the second copy. The copies most likely are being made SAN-SAN, with some snap-shot'ing technology. although the original copy might be via Fed-Ex rather than across the wire. Bandwidth to move 100TB is not easy to come by for anyone.

I think you need a combination of 1, 2, and 3 (not 4), with excellent log backup management.

Actually I think that at any point in time you are really looking at 3 copies of your data. Running CHECKDB on 1 of the copies while the 2nd copy is being used to actually receive changes. Then you snapshot that 2nd copy to the first and continue. With this much data, I'd imagine that you would need some diligence here. Paul, how does checkdb work on a multi-user, 100TB db that is online?

As mentioned, aren't log backups, and probably a log reader, critical? Don't you need to recover drop tables/user error from the logs rather than a backup? You can potentially shortcut this by sending SAN copies through some delay, but I haven't seen that technology. A Log Shipping SAN that can delay changes by 4 hours (or some interval) to allow you to recover from issues before overwriting data. Or some log-reader-of-SAN-block-changes tool? Without that, you need to manage those transaction logs, which might be a whole other level of tracking those backups on various file systems for some xxx hours to allow you to potentially recover from non-fatal errors.

Steve Jones
  • 795
  • 5
  • 8
  • Hey Steve - some customers need versions, some don't. Depends on how advanced their HA/DR thinking is and how much money they have. CHECKDB on a 100TB database? No idea - I never tested it above several TB and AFAIK it hasn't been tested > 10 TB. I'd love to hear how it does in 2005/2008. Thanks – Paul Randal Jun 01 '09 at 20:48
  • Hey, you're the guy that should ask for a test. Maybe Mr. Cox at SQLCAT can run one. The HA/DR situation matters. Amazon might not care about versions. Others might depend on legal/regulatory issues. It's something to think about. – Steve Jones Jun 02 '09 at 02:28
0

Technically, storage is cheap, but at the petabyte level, not so much. It really depends on the application, but I'd say some combination of strategy #2 and #3 is going to be the answer, with #2 a given and #3 depending on how much investment you can make in storage and the kind of storage and IO/computational power that will let you get away with as little incrementalism and as much discreet, full backup as possible.

Alternatively, something like Amazon S3 may also come into play depending on your bandwidth and how much change there is in the data -- at this volume, putting at least some of it on someone else's servers and letting them worry about redundancy gets more and more cost effective.

nedm
  • 5,630
  • 5
  • 32
  • 52
  • I've got to agree with the person who asked the question. Storage is cheap. /Managed/ storage is expensive as hell. – Matt Simmons May 31 '09 at 12:34
0

Speak to your storage vendor, they'll have a deduplication product that they've used before, combined with regular compression you can often reduce your data footprint by 70%. Of course anyone with the money to spend on a petabyte of storage is also likely to have the budget to buy a decent backup solution too - if they haven't then you just need to ask them what losing that petabyte would cost their business.

Chopper3
  • 101,299
  • 9
  • 108
  • 239
  • Yup - had the compression as option 2, and most of these customers don't have a lot of duplication in their data. Disagree about the extra money - sometimes (and often) the data volume growth outstrips budget for redundant storage. Several Fortune-100 companies I work with are in that state for some of their applications. – Paul Randal Jun 01 '09 at 20:50
0

In a large enterprise data warehouse, much of the data comes from sources that are already backed up. I've worked on Teradata and ODW installations where they've taken option #4, but known that they could restore a day or two of transactional data and transform it from the source systems.

At one retail client (at the time they had one of the top 5 largest DWs in the world, at about 200TB ... gives you an idea for how long ago this was), they went with option #1 after purchasing a new Petabyte-class Teradata server. The old nodes would be used for a snapshot of the prior day's system, while the new one maintained the existing. This was also nice from a failover perspective - every once in a while they'd take the whole thing down for maintenance and we'd just switch over to using the old slow server with day-old data.

Honestly though, it seemed like a big waste of processing/storage/etc to keep the thing going ... particularly when the biggest advantage was that their admins and NCR techs had to work fewer evenings to perform irregular maintenance.

Beep beep
  • 1,833
  • 2
  • 18
  • 33