Optimum way to serve 70,000 static files (jpg)?

Question

I need to serve around 70,000 static files (jpg) using nginx. Should I dump them all in a single directory, or is there a better (efficient) way ? Since the filenames are numeric, I considered having a directory structure like:

xxx/xxxx/xxx

The OS is CentOS 5.1

How large are the image files? If they're all (quite) small then a Squid cache or just the filesystem caching will make a huge difference, as most (or all) of them could be cached in memory. — David Gardner, Nov 19 '09 at 12:15

cas · Answer 1 · 2009-07-12T08:57:03.420

6

it really depends on the file system you're using to store the files.

some filesystems (like ext2 and to a lesser extent ext3) are hideously slow when you have thousands of files in one directory, so using subdirectories is a very good idea.

other filesystems, like XFS or reiserfs(*), don't slow down with thousands of files in one directory, so it doesn't matter whether you have one big directory or lots of smaller subdirectories.

(*) reiserfs has some nice features but it's an experimental toy that has a history of catastrophic failures. don't use it on anything even remotely important.

edited Jul 12 '09 at 08:57

answered Jul 12 '09 at 05:29

cas

6,783
32
35

2

ReiserFS has been reasonably stable for a few years now. I've been using it on my computers for several years and I've never had a problem with it. – David Z Jul 12 '09 at 05:54
7

ReiserFS is really killer filesystem, eh? – Evan Anderson Jul 12 '09 at 05:59
3

404 nina not found – cas Jul 12 '09 at 06:19
I'm using ext3. – Ahsan Jul 12 '09 at 06:36
1

i agree with this. one app i'm working with stores all uploaded files in a single directory and navigating/deleting/copying any one of those files is a pain in the butt. – Spencer Ruport Jul 12 '09 at 08:44
3

Urg, that joke is getting realllly old now. Was funny the first 50 times I heard it, hitting the 300th its starting to drone. – Adam Gibbins Jul 13 '09 at 17:49
Hasnt ReiserFS ceased development? Anyways, using XFS here and have seen absolutely no performance problems with large directories. – pauska Nov 19 '09 at 12:06

score 4 · Accepted Answer · answered Jul 12 '09 at 04:23

4

Benchmark, benchmark, benchmark! You'll probably find no significant difference between the two options, meaning that your time is better spent on other problems. If you do benchmark and find no real difference, go with whichever scheme is easier -- what's easy to code if only programs have to access the files, or what's easy for humans to work with if people need to frequently work with the files.

As to whichever one is faster, directory lookup time is, I believe, proportional to the logarithm of the number of files in the directory. So each of three lookups for the nested structure will be faster than one big lookup, but the total of all three will probably be larger.

But don't trust me, I don't have a clue what I'm doing! Measure performance when it matters!

answered Jul 12 '09 at 04:23

kquinn

257
1
5

4

You're absolutely correct about the need to measure, but you are incorrect on the lookup time. It's dependent on filesystem, and many filesystems start showing degraded performance at well below 70k files. – Christopher Cashell Jul 12 '09 at 06:37
Sorry if this is a silly question but ... how do I benchmark this ? – Ahsan Jul 12 '09 at 06:40
P.S Note that I'm using ext3 – Ahsan Jul 12 '09 at 06:43
I'd imagine you just want to wrap a typical call to fopen() in a loop, then pound away opening (and quickly close()ing) a typical set of files by name. Make sure fopen() isn't lazy before you trust those results, though. @Christopher Cashell: Hence the big fat disclaimer :) – kquinn Jul 12 '09 at 07:55

score 4 · Answer 3 · answered Nov 19 '09 at 11:54

As others have said, directory hashing is very probably going to be most optimal.

What I would suggest you do though is make your URIs independent of whatever directory scheme you use, using nginx's rewrite module, e.g. map example.com/123456.jpg to /path/12/34/123456.jpg

Then if your directory structure needs to change for performance reasons you can change that without changing your published URIs.

score 3 · Answer 4 · answered Jul 12 '09 at 08:20

Doing some basic directory hashing is generally a good idea. Even if your file system deals well with 70k files; having say millions of files in a directory would become unmanageable. Also - how does your backup software like many files in one directory, etc etc.

That being said: To get replication (redundancy) and easier scalability consider storing the files in MogileFS instead of just in the file system. If the files are small-ish and some files are much more popular than others, consider using Varnish (varnish-cache.org) to serve them Very Quickly.

Another idea: Use a CDN -- they are surprisingly cheap. We use one that costs basically the same as we pay for "regular bandwidth"; even at low usage (10-20Mbit/sec).

score 3 · Answer 5 · answered Jul 13 '09 at 03:23

3

You could put a squid cache in front on your nginx server. Squid can either keep the popular images in memory, or use it's own file layout for fast look ups.

For Squid, the default is 16 level one directories and 256 level two. These are reasonable defaults for my file systems.

If you don't use a product like Squid, and create your own file structure, then you'll need to come up with a reasonable hashing algorithm for your files. If the file names are randomly generated this is easy, and you can use the file name itself to divide up into buckets. If all your files look like IMG_xxxx, then you'll either need to use the least significant digits, or hash the file name and divide up based on that hash number.

answered Jul 13 '09 at 03:23

brianegge

1,064
2
14
23

Anything that contains the words "in memory", though he didn't tell us the size of those files. – gbarry Aug 04 '09 at 04:59
linux will deliver the popular files from memory anyway without touching the file system. the backend will probably need the hashing anyway as files will still need to be backed up, published to, administered etc – Matt Aug 23 '12 at 13:11
@mindthemonkey do you know where I could find more information on this? E.g. how to monitor what is in memory, how to adjust config etc.? Thanks – UpTheCreek Oct 30 '12 at 08:31
@UpTheCreek here's a decent [overview](http://duartes.org/gustavo/blog/post/page-cache-the-affair-between-memory-and-files) of the internals. Overall usage of the page cache can be seen with `free -m` or `top` or `nmon` (buffers/cached). Specific usage for files can be interrogated with `fincore` [ftools](https://code.google.com/p/linux-ftools/). And you can poke about the cache with [`vmtouch`](http://hoytech.com/vmtouch/) – Matt Nov 05 '12 at 10:00

score 1 · Answer 6 · answered Jul 12 '09 at 06:26

As others have mentioned, you need to test to see what layout works best for you for your setup and usage pattern.

However, you may also want to look at the open_file_cache parameter inside nginx. See http://wiki.nginx.org/NginxHttpCoreModule#open_file_cache

score 1 · Answer 7 · answered Jul 12 '09 at 12:11

By all means benchmark and use that information to help you make a decision but if it was my system I would also be giving some consideration to long term maintenance. Depending on what you need to do it may be easier to manage things if there is a directory structure instead of everything in one directory.

score 0 · Answer 8 · answered Jul 12 '09 at 04:30

Splitting them into directories sounds like a good idea. Basically (as you may know) the reason for this approach is that having too many files in one directory makes the directory index huge and causes the OS to take a long time to search through it; conversely, having too many levels of (in)direction (sorry, bad pun) means doing a lot of disk lookups for every file.

I would suggest splitting the files into one or two levels of directories - run some trials to see what works best. If there are several images among the 70,000 that are significantly more popular than the others, try putting all those into one directory so that the OS can use a cached directory index for them. Or in fact, you could even put the popular images into the root directory, like this:

images/
  021398012.jpg
  379284790.jpg
  ...
  000/
    000/
      000000000.jpg
      000000001.jpg
      ...
    001/
      ...
    002/
      ...

...hopefully you see the pattern. On Linux, you could use hard links for the popular images (but not symlinks, that decreases efficiency AFAIK).

Also think about how people are going to be downloading the images. Is any individual client going to be requesting only a few images, or the whole set? Because in the latter case, it makes sense to create a TAR or ZIP archive file (or possibly several archive files) with the images in them, since transferring a few large files is more efficient than a lot of smaller ones.

P.S. I sort of got carried away in the theory but kquinn is right, you really do need to run some experiments to see what works best for you, and it's very possible that the difference will be insignficant.

score 0 · Answer 9 · answered Jul 12 '09 at 05:47

0

I think its a good idea to break the files up in a hierarchy, for no other reason that if you ever need to drop down and do an ls on the directory it will take less time.

answered Jul 12 '09 at 05:47

Nick Anderson

41
1

score 0 · Answer 10 · answered Jul 13 '09 at 03:09

0

I don't know aboutext4, but stock ext2 cannot handle that many files in one dir, reiserfs (reiser3) was designed to handle that well (an ls will still be ugly).

answered Jul 13 '09 at 03:09

Ronald Pottol

1,703
1
11
19

score 0 · Answer 11 · answered Feb 27 '13 at 16:08

0

Would it be worth it to you to dump those files in an amazon S3 bucket and serve them from there?

Let them worry about optimization.

answered Feb 27 '13 at 16:08

Gaia

1,855
5
34
60

score 0 · Answer 12 · answered Aug 04 '09 at 01:40

The organization of the files has more to do with file system performance and stability than delivery performance. I'd avoid ext2/ext3 and go with xfs or reiser.

You will really want to look into caching. Whether it be the built in web server caching or a third party cache like varnish.

As mentioned by kquinn, benchmarking will be the real indicator of performance gains/losses.

Optimum way to serve 70,000 static files (jpg)?

12 Answers12