How to store 250TB of data and develop a backup/recovery plan?

Question

I'm really new to this topic, so big apology for stupid questions.

I have a school project and I want to know how to store 250TB of data with life-cycle for 18 months. It means every record is stored for 18 months and after this period of time can be deleted.

There are 2 issues:

store data
backup data

Due to amount of data I will probably need to combine data tapes and hard drives. I'd like to have "fast" access to 3 month old data, so ~42TB on disk. I really don't know what RAID should I use, or is here any better solution than combining disk and data tapes?

Thanks for any advice, article, anything. I'm getting lost.

What kind of school project are you doing that has a budget that'll cover 250TB of storage?! — ceejayoz, Apr 13 '12 at 15:23
With that amount of data and requirements I think you should get a storage expert involved. Choosing the RAID is the easy part. — Lucas Kauffman, Apr 13 '12 at 15:23
Fast access? While you can use drive storage for this, you usually wouldn't be needing to constantly go back to look at backed up data. That's what the live storage space is for. Or you have very poor management over not deleting data you need. With tapes, usually the backup software will tell you which tape you need to insert and it'll pull the data back off without much time wasted. — Bart Silverstrim, Apr 13 '12 at 15:28

score 4 · Accepted Answer · answered Apr 13 '12 at 16:06

250TB is a lot of data. I will give you an example of how I would accomplish this task in the enterprise, which would be fairly concerned with budget (since I assume you want this on the cheap), yet not overly concerned with finding the best free products to do the job.

Just an FYI - I am writing this as an 8 year professional of both the storage world and the backup/disaster recovery world.

I feel like this school project is more about writing on how to do this, rather than actually doing it?

First of all, the storage.

Since you did not mention any specific availability or redundancy requirements I would suggest building a basic JBOD array of "NearLine" 3TB SATA disks. At your estimate of 42TB online you would need at least 14 of these, ignoring RAID overhead. For example if you chose RAID-6 with a 16 disk raid group size, you would need at least 16 disks to get 42 TB usable and you would still have no hot spares. Until I had a better idea of your reliability, performance, redundancy, availability requirements I could not recommend other types of disks, raid types, or controllers.

In its very simplest form you could build an array like this using fairly cheap commodity hardware and Linux along with some open source tools like LVM, FreeNas, OpenFiler, etc. - beyond that you are starting to get into the pricy enterprise storage space.

Also keep in mind that using cheap commodity hardware to do this doesn't factor in other redundancy concerns beyond disks either (power supplies, controllers, operating system, etc.).

In the enterprise space I will assume you need substantial read/write performance and high availability. As an example - You could use a NetApp Enterprise storage array with highly available clustered redundant controllers. Attached to these would be drawers of 24 600gb 15k rpm SAS disks. To get 42tb out of a setup like this, which would perform extremely well and be highly available/redundant you would need (assuming 64-big NA Aggregates with a size limit above 16tb) an aggregate containing roughly 5 16 disk raid groups if you are configured in the default RAID6-DP raid level.

That's at least 80 15k RPM 600gb SAS disks across 4 shelves of storage attached to redundant arrays.

At this point you are in need of racks and some serious power and cooling and your budget well exceeds $200k.

Now for archiving.

You have a plethora of options here, there are literally countless products and methods you could use to accomplish this part of your task. As such, I am going to write it from the perspective of using a specific application which I know can do this job well, IBM's Tivoli Storage Manager (TSM). I am also going to assume that you don't have any off-site disaster recovery requirements and simply need to store lots of data and disk has become too expensive at this point.

So to set up TSM you need another server, as well as some number of tape drives and/or an automated tape library (ATL).

The server where the data is mounted would have a TSM client and you can schedule standard backup jobs or archival jobs depending on your needs. This scheduled job could be scripted or otherwise set up to archive data to tape, and subsequently delete it from disk - making it available on tape offline. For example, you could have the script archive any data older than 90 days to tape, and then delete it. This is another area where there are countless ways to accomplish this task.

As for the hardware side of things - LTO tape might be the best option and LTO-5 can hold around 1.5tb of uncompressed data per cartridge. So since you need over 200tb of data to be on tape with the other ~50tb on disk, you are looking at needing at least 140 tapes for this project.

Bringing it all together

So we have a storage array of some sort, and a "backup infrastructure" in place. Lets assume all of this life cycle stuff is happening on one server. You need a way to tie it all together. Is the disk going to be attached to the server over a SAN? Over a network? What protocol will you use? All of these decisions impact what type of hardware you would need. Just looking at the tape requirements you would likely need at least a small ATL which would pretty much guarantee you need a fiber-channel SAN, along with SAN switches, adapters, etc. You would need network infrastructure on top of that for any sort of network communication requirements.

The more I wrote the more I realized there is no way this project could not be real and I got less and less specific. Keep in mind, this was written with a number of wild assumptions and very conservative estimates - the TL;DR version is - you would need a ton of hardware, loads of expertise, and lots of money to get this done, even if done in the most unreliable, cheap way possible. If you need any more help or information feel free to ping me.

Project is more about analysis. Great reading, it is definitely a good start for me. Thanks! — luccio, Apr 13 '12 at 18:50
All info about project requirements are in my first post. We have no more information. Just number of data we need to manage. School project :) — luccio, Apr 13 '12 at 18:59

score 2 · Answer 2 · answered Apr 13 '12 at 15:59

Since this is a school project, I'm assuming that you don't need to actually build this, just spec it out. Either way though, you should read these two articles:

Petabytes on a Budget v2.0: Revealing More Secrets

Why you should never build a backblaze pod

How to store 250TB of data and develop a backup/recovery plan?

2 Answers2