What do I need to consider when scaling an application that stores files in the filesystem?

Question

I am interesting in making an app where users can upload large files (~2MB) that are converted into html documents. This application will not have a database. Instead, these html files are stored in a particular writable directory outside of the document source tree. Thus this directory will grow larger and larger as more files are added to it. Users should be able to view these html files by visiting the appropriate url. All security concerns aside, what do I need to be worried about if this directory continues to grow? Will accessing the files inside take longer when there are more of them? Will it potentially crash because of this? Should I create a new directory every 100 files or so to prevent this?

It it is important, I want to make this app using pyramid and python

You should look into Amazon S3 storage – reptilicus Feb 15 '13 at 19:02 — reptilicus, Feb 15 '13 at 19:02

sotapme · Accepted Answer · 2013-02-15T19:34:27.037

You might want to partition the directories by user, app or similar so that it's easy to manage anyway - like if a user stops using the service you could just delete their directory. Also I presume you'll be zipping them up. If you keep it well decoupled then you'll be able to change your mind later.

I'd be interested to see how using something like SQLite would work for you, as you could have a sqlite db per partitioned directory.

I presume HTML files are larger than the file they uploaded, so why store the big HTML file.

Things like Mongodb etc are out of the question? as is your app scales with multiple servers you've the issue of accessing other files on a different server, unless you pick the right server in the first place using some technique. Then it's possible you've got servers sitting idle as no one wants there documents.

Why the limitation on just storing files in a directory, is it a POC?

EDIT

I find value in reading things like http://blog.fogcreek.com/the-trello-tech-stack/ and I'd advise you find a site already doing what you do and read about their tech. stack.

As someone already commented why not use Amazon S3 or similar.

Ask yourself realistically how many users do you imagine and really do you want to spend a lot of energy worrying about being the next facebook and trying to do the ultimate tech stack for the backend when you could get your stuff out there being used.

Years ago I worked on a system that stored insurance certificates on the filesystem, we use to run out of inodes.!

Dare I say it's a case of suck it and see what works for you and your app.

EDIT

HAProxy I believe are meant to handle all that load balancing concerns.

As I imagine as a user I wants to http://docs.yourdomain.com/myname/document.doc although I presume there are security concerns of it being so obvious a name.

what about creating a different directory for every server? So for servers 1-... the url would be www.domain.com/1/dlfksjd.html or www.domain.com/2/sldkjrr.html for example. How could I arrange this with a web hosting provider? — BigBoy1337, Feb 15 '13 at 19:15
concerning mongodb or other similar xml datastores, I am under the impression that while these have the advantage of not being bogged down by large files, the read access (something very important to me) would not be as fast as it would be with filesystem storage. Please correct me if I am wrong. — BigBoy1337, Feb 15 '13 at 19:30
Read something like http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances-dozens-of you could argue it's similar as they're storing images that have been transformed somehow. I'd read the real life experiences of people actually doing it and not the opions of SO folk. :D — sotapme, Feb 15 '13 at 19:39

score 1 · Answer 2 · answered Feb 15 '13 at 19:07

This greatly depends on your filesystem. You might want to look up which problems the git folks encountered (also using a sole filesystem based database).

In general, it will be wise do split that directory up, for example by taking the first two or three letters of the file name (or a hash of those) and group the files into subdirectories based on that key. You'd have a structure like:

uploaddir/
    00/
         files whose name sha1 starts with 00
    01/
         files whose name sha1 starts with 01

and so on. This takes some load off the filesystem by partitioning the possibly large directories. If you want to be sure that no user can perform an Denial-of-Service-Attack by specifically uploading files whose names hash to the same initial characters, you can also seed the hash differently or salt it or anything like that.

Specifically, the effects of large directories are pretty file-system specific. Some might become slow, some may cope really well, others may have per-directory limits for files.

What do I need to consider when scaling an application that stores files in the filesystem?

2 Answers2