2

I need to be able to share user uploaded content across multiple EC2 application servers. I have looked at rsync, mounted NFS, and S3 as potential options of being able to share this data in almost real-time. The uploaded and downloaded user files are almost always between 1-10MB. Some are accessed a lot and some just once and then deleted.

My newest approach involves launching an EC2 instance strictly as a file server, separate from the application servers. With this option, for a user to download a file, they are connected to one of the application servers which queries the database with data about the file they wish to download. The user is then prompted to download which connects them to the file server for download.

I feel like this option will be faster than my other options. The only downside that I see is that I can't autoscale up/down file servers. I can however scale up and make a column in the database that says which file server the file is located on.

Is this a good approach or am I missing something? Also, what is a good way to determine how many concurrent uploads/downloads can occur on the file server based on server specs and with files being between 1-10MB or is that something best determined from load testing?

Also in terms of scaling, will it be a problem if 1 particular file located on just 1 file server becomes extremely popular? Would using a CDN solve this problem?

user2093708
  • 113
  • 3
  • I think it's a good approach. You can start with the single server instance and if / when your app grows you can scale by moving to CDN and do a database update to change the pointers to the new locations of those files. – user16081-JoeT Dec 19 '13 at 17:34
  • Curious, what made you rule out S3? – rtf Dec 19 '13 at 17:46
  • You may want to consider something like GlusterFS – ceejayoz Dec 19 '13 at 17:48
  • I haven't completely ruled out S3 but in some of my testing, latency has been very inconsistent during uploads to S3. I have also always felt S3 is more for long term storage and not for delivering, in this case, media content. If I do end up using S3, I would pair it with Cloudfront. However, I don't know if I would have the flexibility I need. Ideally, I would like to only add a file to cloudfront when it has been shared with X number of users and while I don't know for sure, this would probably be easier to do with an EC2 file server than S3 – user2093708 Dec 19 '13 at 17:54
  • I haven't really thought the following through but with Cron I could move a popular file to a different folder on the file server. This folder would be the CDN origin. Correct me if I am wrong but I think with S3 all my files would be put into CDN which doesn't make sense for me when some of the files will only be seen and viewed once by one person. – user2093708 Dec 19 '13 at 18:04
  • In my experience, S3 has been fast. Are you getting your S3 credentials from the instance profile? If so, are you caching the credentials? My experience is that there is great latency in fetching creds from profile. – Edwin Dec 19 '13 at 19:47
  • I'll give it a try that way – user2093708 Dec 19 '13 at 19:52

3 Answers3

1

A CDN would be the better option for you, using S3 with CloudFront would. My recommendation would be to decentralized the user generated content from the application server(s), keeping your servers volatile when scaling up or down within you architecture is a good design practice.

1

S3 and CloudFront would be the first option, but if you're finding latency is not acceptable then there are others.

If a single file server is working well for you, you could make the transition to a scalable, distributed file server platform like GlusterFS. This allows you to store files across multiple EC2 instances and have them appear as a single mount. You can use the "replica 2" option to create 2 copies of each file for redundancy. Then use two instances in different Availability Zones to increase availability. The files themselves are stored on any EC2-supported disk which includes EBS with provisioned IOPS or even SSD ephemeral (I've done this before - the redundancy of Gluster makes the volatility of ephemeral less of a concern so you can get the benefit of SSD fast IO for your critical data).

1

You want to architect your EC2s so they don't have any unique data on them, think of them simply as compute machines.

You have a few options.

S3

Scalable and reliable service to store and retrieve files. It doesn't work well as a file system so if you're doing plenty of reads and writes it's not a great solution.

CloudFront (CDN)

Static files (css, js, images) can be served out of CloudFront (which can source it's data from S3 or EC2s). This greatly improves performance, so you could use S3 to get your files and serve them from CloudFront.

GlusterFS

You can use a cluster of EC2s as network attached storage. Of course this adds a little more complexity to your setup and isn't the fastest solution.

Elasticache / Memecached

You can host your own memecached or use the Elasticache service. This solution isn't file storage, but is useful as a high-performance, distributed memory object caching system.

Drew Khoury
  • 4,637
  • 8
  • 27
  • 28