6

I have a web application that serves binary files (images, etc). Our application runs on Amazon EC2. We were originally going to use Amazon S3 to store and serve these files, this is no longer an option.

We need to transfer these files over HTTPS using a CNAME. This is obviously impossible with Amazon S3 for many technical reasons. Amazon offers Elastic Block Storage (EBS) which allow you to mount a block of up to 1TB in size to one instance. We will have multiple instances accessing this data in parallel.

What I was thinking is using a distributed file system like MogileFS/GluserFS/[insert-more-here] with Elastic Block Storage (EBS).

So my question is: What are others currently doing to create a scalable (a few 100TBs) file storage system over Amazon EC2 without using Amazon S3 thats redundant? Data will still be backed up on Amazon S3 but all reads would be off the file system.

Thanks in advanced. If anyone needs clarification on anything please feel free to ask.

William
  • 367
  • 4
  • 11
  • 1
    Could you pound or some other https proxy on a small instance and still keep everything in s3? – cagenut Dec 20 '09 at 21:09

4 Answers4

2

In Azouk (formerly linked domain dormant/parked) we don't use Amazon EC2, but we use GlusterFS (1.4.0qa92) for serving all content like PDFs, user files, thumbnails and also for offline data analysis. IMHO there should be no problem deploying same architecture on Amazon's cloud — we already heavily use virtualization (OpenVZ in particular). The only potential constraint is mounting GFS via fuse (virtualization could forbid this), but AFAIK it's possible on Amazon.

So, I recommend Gluster and sorry I can't help specifically with Amazon :)

Steffen Opel
  • 5,638
  • 37
  • 55
1

A terribly old question that suddenly bubbled up on the frontpage again... :-)

So my question is: What are others currently doing to create a scalable (a few 100TBs) file storage system over Amazon EC2 without using Amazon S3 thats redundant?

Nothing, on AWS you would use S3 for 100 TB's BLOB storage, anything else would be nonsensical.

We need to transfer these files over HTTPS using a CNAME. This is obviously impossible with Amazon S3 for many technical reasons.

True, but it is possible by other means.

Since you need HTTPS access on your own domain name, you would set up a couple of HTTPS servers (or proxies) on EC2 nodes, to act as SSL encryption/decryption gateways between the Internet and S3.

I have never worked with Apache Traffic Server (formerly Inktomi), but it looks like a great fit for this. Otherwise nginx or Apache could be used for the SSL handling, together with Squid or Varnish if you want caching.

At high level, the request-response looks something like this:

Internet request via https -->
(optional) Elastic Load Balancing -->
EC2 instance with SSL capable HTTP proxy (fx nginx) -->
plain unencrypted http to S3

In addition, you'll need a deterministic way to handle URL rewriting. Fx. https://secure.yourdomain.com/<id> is rewritten to http://<bucket>.s3.amazonaws.com/<id>

0

I know that Acquia runs Gluster on EBS with EC2. So technically it appears to work.

3dinfluence
  • 12,449
  • 2
  • 28
  • 41
  • wow I just realized this is a really old question....I wonder how it got up to the top of the list where I saw it. – 3dinfluence Sep 09 '10 at 23:39
0

I am currently working on building a replicated clustered file system based on Gluster 3.1 and EBS, with access via FUSE client.

If you have a substantial investment in a web app that has lots of file calls baked into it, and you want to upgrade to accessing from multiple load balanced app servers -and- create scalable replicated storage without re-writing all your file access code, it seems like this is pretty much your only simple option.

I haven't completed the project, so I don't have lots of feedback on a finished result. There is a simple tutorial here