0

I have a SaaS application running on 6+ servers in HPCloud which creates large amounts of data (GB/TB). Users talk to the application through a RESTful api which replies back with a link to our CDN where they can download their file.

My questions:

  1. From my research and previous question here on SF, storing all generated data on some kind of centralized storage (e.g., via NAS/SAN) would be the best solution so my CDN always knows where the files are to serve up -- which would also enable better scaling in the future. Since I'm on a cloud similar to Rackspace, what are my options in doing this?

  2. For my own reference, how do companies like mediafire store TB/PB of data and LB their downloads at the same time? Do they just have tons of servers connecting to the same NAS/SAN?

UPDATE

Data requested by Ablue:

Are you creating files to be served by http? Yes these files will be primarily downloaded through HTTP

Do you need block level storage? Not currently, but in the future this may be the case

HOW MUCH STORAGE DO YOU WANT? Currently I could get away with having ~300GB, but I'd need to be able to scale out in the future

What sort of access speeds do you want/need? The faster the better for writing but read times don't matter as much. Main thing here, is that using a system like S3 increases the latency because of how long it can take to copy data

Do you have a budget? Yes/No... for the cloud I'm in I can basically spin up 3-5 more servers with around 120GB storage each

Obto
  • 125
  • 6

2 Answers2

4

TL;DR

1) On a cloud, not that many cheap options unless you want to go for an S3-like system. With a centralized system, you can only scale so far before you start running into issues (See scaling up vs. scaling out) so if you are rolling your own solution you'd probably be best starting off with a distributed system that lets you add and remove servers on demand, rather than just getting a big SAN and keep adding disks.

2) They will almost certainly use dedicated hardware, co-located or in private datacenters. If you go to a storage provider and say "hey, I want to buy 2000 disks" they'll give you some pretty decent discounts if you know what you're doing. Storing 100TB of data will always be cheaper (Per GB) than storing 100GB, the more you store the cheaper it gets.


Have a look into a distributed data store like HFS or Riak. Never used HFS but we're using a Riak cluster on 4 nodes with 10TB of storage. RIAK has a HTTP API so with a little bit of careful configuration you can just point your CDN to your Riak cluster. Alternatively just use S3, RackSpace cloud files, Google Storage etc. and let someone else worry about that for you. Since pre-existing storage providers are already on multi-TB/PB of storage, they can most likely do it cheaper than you would be able to roll your own.

That being said, BackBlaze (Online backup company) "open sourced" the designs for their storage "pods" which store ridiculous amounts of data very cheaply. They are more suited to "write once, sit there doing nothing for years" as is the nature of backups, but it's still an interesting read. You could also look into something like the BroadBerry storage servers, their top end model has 36 hot-swap drive bays but costs +$5k without drives (Filling it with 2TB enterprise 7200RPM drives you're looking at more like $25k, or with cheap drives $15k, that entirely depends on your workload). OVH provide some "backup" servers with ~20TB of un-RAIDed storage for around £200/mo if I remember correctly.

You also need to think about tiered storage. Basically, this means you split your data up into "tiers" based on what you need. If some of your objects must be kept at all costs, and need to be accessed quickly, they should be on top, or "gold" tier storage with fast, reliable disks, on servers well equip to handle the load. This might be the sort of thing you would put on a high-end SAN with lots of lovely SAS or even SSD disks. If you have some objects which are re-generatable and don't need to be accessed quickly (Say, thumbnails for images that are normally cached on CDN edges), you can put those on "silver" tier storage; cheaper disks, on slower servers. Then you have your backups, while you may never need them, and they might not need to be available immediately, you want to keep them for as long as possible, as cheaply as possible. You might put those on "bronze" storage, like tapes.

The storage levels I described are for a purely fictional situation, it's entirely possible to have 50 tiers of storage, and you can call them whatever you want. It might be that even your lowest tier of storage requires super-fast access, that all depends on your usage.

  • First, thanks for taking the time to explain all of this! Second, the main reason I'm not using a service like S3 is because copying the generated files over to it would drastically increase latency, but may be the best option for now. Third, if I could spin up 3 or so servers to make my own storage solution, what software/setup would you recommend? – Obto Feb 08 '12 at 09:42
  • @Obto Software is personal preference really. I've had some great experiences with RIAK. Set up a cluster and put an NGINX server in front pointing to all the RIAK nodes as an upstream (make sure you only allow HEAD and GET requests, and restrict what URLs people can access so they can't get server statistics). You could also look into something like MongoDB's GridFS, HDFS (Hadoop Filesystem), or pretty much any NoSQL object storage system that lets you store blobs. –  Feb 08 '12 at 09:55
  • I'll try out RIAK then and see how it goes. Do you know recommend any tutorials etc... on setting this up? – Obto Feb 08 '12 at 10:16
  • @Obto If you go to Basho's site, they have a 'getting started' guide that shows you how to set up a cluster, but it is literally as simple as `riak-admin join nodename@server` to get started –  Feb 08 '12 at 10:17
1

It is kind of important to know what files need to be accessed and how.

  • Are you creating files to be served by http?
  • Do you need block level storage?
  • HOW MUCH STORAGE DO YOU WANT?
  • What sort of access speeds do you want/need?
  • What level of service do you need to provide?
  • Do you have a budget?

When people want to store large amount of data with low latency and high speeds a SAN is normally used. Fibre Channel is often used for best possible latency, but iSCSI and NFS perform very well too. Obviously you can't connect fibre to a VPS and iSCSI and NFS perform best when isolated (separate NICs and VLAN) and with the largest MTU you can handle, so a VPS isn't suited here.

In this scenario you would need to colocate your own physical servers.

This is all assuming that is a requirement for the files you need to access, and assuming you don't just buy more storage from your provider.

You really at least need to address the points above before you can start getting into any specifics.

Edit (response to question edit):

You mentioned load-balancing. If you are using your own hardware you will probably want to use some sort of ACTIVE:ACTIVE HA Cluster.

Sam's suggestion of using RAIK is a really good idea given your criteria.

Personally I think if you are going to invest in hardware and colocation you should have a solid plan on how you want to or expect to grow; this should help prevent you from investing in wrong areas.

At this stage in the game you may want to go with Sam's suggestion, another thought may be to purchase some VPSs located at places around the world where you expect your usage to be, each with storage you want (300gb should be pretty inexpensive); then replicate the data between them. You can use DNS to load balance using round-robin or something more complicated if you like (referal by geolocation or something). Expanding storage is pretty painless for a VPS.

Running your own hardware at this stage will be extremely expensive with little benefit. When/if you need tb/pb storage, it may be time to invest in hardware, in which case you just purchase the hardware to provide what-ever you are currently getting hosted.

Matt
  • 1,142
  • 1
  • 12
  • 32