5

I need to serve around 70,000 static files (jpg) using nginx. Should I dump them all in a single directory, or is there a better (efficient) way ? Since the filenames are numeric, I considered having a directory structure like:

xxx/xxxx/xxx

The OS is CentOS 5.1

Ahsan
  • 103
  • 1
  • 1
  • 7
  • How large are the image files? If they're all (quite) small then a Squid cache or just the filesystem caching will make a huge difference, as most (or all) of them could be cached in memory. – David Gardner Nov 19 '09 at 12:15

12 Answers12

6

it really depends on the file system you're using to store the files.

some filesystems (like ext2 and to a lesser extent ext3) are hideously slow when you have thousands of files in one directory, so using subdirectories is a very good idea.

other filesystems, like XFS or reiserfs(*), don't slow down with thousands of files in one directory, so it doesn't matter whether you have one big directory or lots of smaller subdirectories.

(*) reiserfs has some nice features but it's an experimental toy that has a history of catastrophic failures. don't use it on anything even remotely important.

cas
  • 6,783
  • 32
  • 35
4

Benchmark, benchmark, benchmark! You'll probably find no significant difference between the two options, meaning that your time is better spent on other problems. If you do benchmark and find no real difference, go with whichever scheme is easier -- what's easy to code if only programs have to access the files, or what's easy for humans to work with if people need to frequently work with the files.

As to whichever one is faster, directory lookup time is, I believe, proportional to the logarithm of the number of files in the directory. So each of three lookups for the nested structure will be faster than one big lookup, but the total of all three will probably be larger.

But don't trust me, I don't have a clue what I'm doing! Measure performance when it matters!

kquinn
  • 257
  • 1
  • 5
  • 4
    You're absolutely correct about the need to measure, but you are incorrect on the lookup time. It's dependent on filesystem, and many filesystems start showing degraded performance at well below 70k files. – Christopher Cashell Jul 12 '09 at 06:37
  • Sorry if this is a silly question but ... how do I benchmark this ? – Ahsan Jul 12 '09 at 06:40
  • P.S Note that I'm using ext3 – Ahsan Jul 12 '09 at 06:43
  • I'd imagine you just want to wrap a typical call to fopen() in a loop, then pound away opening (and quickly close()ing) a typical set of files by name. Make sure fopen() isn't lazy before you trust those results, though. @Christopher Cashell: Hence the big fat disclaimer :) – kquinn Jul 12 '09 at 07:55
4

As others have said, directory hashing is very probably going to be most optimal.

What I would suggest you do though is make your URIs independent of whatever directory scheme you use, using nginx's rewrite module, e.g. map example.com/123456.jpg to /path/12/34/123456.jpg

Then if your directory structure needs to change for performance reasons you can change that without changing your published URIs.

Alnitak
  • 21,191
  • 3
  • 52
  • 82
3

Doing some basic directory hashing is generally a good idea. Even if your file system deals well with 70k files; having say millions of files in a directory would become unmanageable. Also - how does your backup software like many files in one directory, etc etc.

That being said: To get replication (redundancy) and easier scalability consider storing the files in MogileFS instead of just in the file system. If the files are small-ish and some files are much more popular than others, consider using Varnish (varnish-cache.org) to serve them Very Quickly.

Another idea: Use a CDN -- they are surprisingly cheap. We use one that costs basically the same as we pay for "regular bandwidth"; even at low usage (10-20Mbit/sec).

Ask Bjørn Hansen
  • 520
  • 1
  • 3
  • 11
3

You could put a squid cache in front on your nginx server. Squid can either keep the popular images in memory, or use it's own file layout for fast look ups.

For Squid, the default is 16 level one directories and 256 level two. These are reasonable defaults for my file systems.

If you don't use a product like Squid, and create your own file structure, then you'll need to come up with a reasonable hashing algorithm for your files. If the file names are randomly generated this is easy, and you can use the file name itself to divide up into buckets. If all your files look like IMG_xxxx, then you'll either need to use the least significant digits, or hash the file name and divide up based on that hash number.

brianegge
  • 1,064
  • 2
  • 14
  • 23
  • Anything that contains the words "in memory", though he didn't tell us the size of those files. – gbarry Aug 04 '09 at 04:59
  • linux will deliver the popular files from memory anyway without touching the file system. the backend will probably need the hashing anyway as files will still need to be backed up, published to, administered etc – Matt Aug 23 '12 at 13:11
  • @mindthemonkey do you know where I could find more information on this? E.g. how to monitor what is in memory, how to adjust config etc.? Thanks – UpTheCreek Oct 30 '12 at 08:31
  • @UpTheCreek here's a decent [overview](http://duartes.org/gustavo/blog/post/page-cache-the-affair-between-memory-and-files) of the internals. Overall usage of the page cache can be seen with `free -m` or `top` or `nmon` (buffers/cached). Specific usage for files can be interrogated with `fincore` [ftools](https://code.google.com/p/linux-ftools/). And you can poke about the cache with [`vmtouch`](http://hoytech.com/vmtouch/) – Matt Nov 05 '12 at 10:00
1

As others have mentioned, you need to test to see what layout works best for you for your setup and usage pattern.

However, you may also want to look at the open_file_cache parameter inside nginx. See http://wiki.nginx.org/NginxHttpCoreModule#open_file_cache

Jauder Ho
  • 5,507
  • 2
  • 19
  • 17
1

By all means benchmark and use that information to help you make a decision but if it was my system I would also be giving some consideration to long term maintenance. Depending on what you need to do it may be easier to manage things if there is a directory structure instead of everything in one directory.

John Gardeniers
  • 27,458
  • 12
  • 55
  • 109
0

Splitting them into directories sounds like a good idea. Basically (as you may know) the reason for this approach is that having too many files in one directory makes the directory index huge and causes the OS to take a long time to search through it; conversely, having too many levels of (in)direction (sorry, bad pun) means doing a lot of disk lookups for every file.

I would suggest splitting the files into one or two levels of directories - run some trials to see what works best. If there are several images among the 70,000 that are significantly more popular than the others, try putting all those into one directory so that the OS can use a cached directory index for them. Or in fact, you could even put the popular images into the root directory, like this:

images/
  021398012.jpg
  379284790.jpg
  ...
  000/
    000/
      000000000.jpg
      000000001.jpg
      ...
    001/
      ...
    002/
      ...

...hopefully you see the pattern. On Linux, you could use hard links for the popular images (but not symlinks, that decreases efficiency AFAIK).

Also think about how people are going to be downloading the images. Is any individual client going to be requesting only a few images, or the whole set? Because in the latter case, it makes sense to create a TAR or ZIP archive file (or possibly several archive files) with the images in them, since transferring a few large files is more efficient than a lot of smaller ones.

P.S. I sort of got carried away in the theory but kquinn is right, you really do need to run some experiments to see what works best for you, and it's very possible that the difference will be insignficant.

David Z
  • 5,475
  • 2
  • 25
  • 22
0

I think its a good idea to break the files up in a hierarchy, for no other reason that if you ever need to drop down and do an ls on the directory it will take less time.

0

I don't know aboutext4, but stock ext2 cannot handle that many files in one dir, reiserfs (reiser3) was designed to handle that well (an ls will still be ugly).

Ronald Pottol
  • 1,703
  • 1
  • 11
  • 19
0

Would it be worth it to you to dump those files in an amazon S3 bucket and serve them from there?

Let them worry about optimization.

Gaia
  • 1,855
  • 5
  • 34
  • 60
0

The organization of the files has more to do with file system performance and stability than delivery performance. I'd avoid ext2/ext3 and go with xfs or reiser.

You will really want to look into caching. Whether it be the built in web server caching or a third party cache like varnish.

As mentioned by kquinn, benchmarking will be the real indicator of performance gains/losses.

David
  • 3,555
  • 22
  • 17