4

I need to set up some VPSs for serving static content (many small files). I plan on using Nginx for this, and would like to set it up so that we are able to scale out relatively easily. The requirements are:

  • Many files (at least hundreds of thousands).
  • Small file sizes (less than 10KB).
  • Files are being added constantly by an application on neighbouring servers.
  • New files must be immediately available to all Nginx servers.

My current plan is:

  • Have a 'master' server with an NFS share containing all of the files.
  • Application producing new files interacts only with master.
  • Have multiple Nginx servers mounting this NFS share.
  • Load balance across Nginx instances.

One obvious problem with this is that the 'master' server is a single point of failure (any remedy for this?). Are there other potential issues which I have overlooked? Are there elements here which will not scale well in this way? Would anyone suggest an alternative approach?

Regarding memory requirements, I'm assuming I should give each Nginx server as much as possible so that hot files can be cached (by OS? Nginx?) an not have to be reqested form the NFS share constantly.

Lastly, am I crazy not to use a CDN?

UpTheCreek
  • 1,628
  • 10
  • 32
  • 48
  • 3
    If you keep going down this road, you'll eventually implement a CDN. Save yourself all the trouble :) – Michael Hampton Mar 23 '13 at 16:23
  • CDNs have significant drawbacks in some use cases for instance inconsistency with cache invalidation, requirement to store and retrieve fines from certain geo-location, compression and custom headers and eventually cost (for instance CloudFlare is a way cheaper than CloudFront to serve content of S3 bucket). Some people do live with limits and use CDN, some people don't. Serving many small files via SPDY multiplexing in some scenario can replace CDN with no visible difference. Nginx S3 proxy is the other approach where you pay a little to AWS for S3 storage but serve content through micro EC2. – Anatoly Jul 14 '15 at 19:40

4 Answers4

8

NFS does not scale. It adds latency to every request and will eventually become too big a bottleneck. We have a similar issue at work, but with photos (so, much larger files) and wrote our own software to shard and distribute them. For a few GB of files like you have, you might be able to get away with the upload process doing an HTTP PUT to all servers and doing resyncs when servers have been offline.

Or tackle it another way: have a (set of) central server(s) with all files and caching reverse proxies (squid, pound, varnish) that actually serve the files to customers.

And you're not crazy not to use a CDN. You're crazy if you don't investigate whether it's worthwhile though :-)

Dennis Kaarsemaker
  • 19,277
  • 2
  • 44
  • 70
  • I see - good to know, thank you! I was really hoping to avoid having to have the application servers aware of the number/configuration of nginx servers and doing multiple operations (e.g. having to handle situation where one PUT fails). Regarding resyncs - you mean using rsync? Might it be viable just to update one server and have rsync deal with propagation after each file put? – UpTheCreek Mar 23 '13 at 10:25
  • 1
    Resyncs can be done with rsync indeed. The rsync-after-upload trick is what we used to do, but it doesn't scale all that well. If you don't want your appservers to be intelligent (which is not a bad idea), the proxy suggestion might work better for you. – Dennis Kaarsemaker Mar 23 '13 at 10:27
  • Thanks. One last question then :) - I've no experience with caching proxies - do they cache in RAM only? Or do they also cache to local disk? Just wondering what amount of RAM would be required to keep them going back to the content servers. – UpTheCreek Mar 23 '13 at 10:57
  • Ah, looks like varnish can use ram and disk. – UpTheCreek Mar 23 '13 at 11:04
  • If you're going to cache files or use a CDN, you need a careful file naming strategy. Most important, every modification to a file _must_ generate a new filename -- you can never change a file while keeping the name the same. Also think carefully about how you will delete files and invalidate them in the cache or CDN. – Mike Scott Jul 14 '15 at 16:55
2

Use cachefilesd (and a recent linux kernel, with cachefs) to cache NFS files to a local HD. This way, every read in the nfs will copy the file to a /var/cache/fs dir and next reads will be delivered from there, with the kernel checking in the nfs if the content is still valid.

This way you can have a central NFS, but without losing the performance of local files

Cachefilesd will take care of the cleaning of old files when the free size/inodes reach a configured level, so you can serve uncommon data from the NFS and common requests from the HD

After setting up this, use a varnish to deliver the content, it will cache most used requests, saving a ton of requests to the nginx/NFS.

Here is a small cachefilesd howto.

higuita
  • 1,173
  • 9
  • 13
1

I would recommend getting a single (potentially dedicated) server for this, instead of using several individual VPS servers and separate nginx instances connected through nfs. If you're thinking about using VPS and NFS, I don't think your concerns about scalability are justified.

nginx does almost all of its caching through the filesystem of the machine, so, if you're going to use nginx for this, you must ensure to have an operating system that has an excellent filesystem performance and caching. Make sure your kernel has enough vnodes etc.

If you're still thinking about separate machines (my suggestion, as above, is to use one machine with one nginx), then it might make sense to investigate varnish. Varnish does all of its caching in virtual memory, so, you wouldn't have to worry about vnodes or cache inefficiencies with smaller files. Since it's using virtual memory, its cache can be as large as physical memory + swap.

I would highly recommend against squid. If you want to know why, just look at a varnish presentation, which describes why virtual memory is the best way to go for an acceleration proxy. But varnish only does acceleration, so if you're using a single host with static files and good filesystem caching (e.g. FreeBSD), then nginx would probably be the best choice (otherwise, with varnish, you'll end up with the same content double-cached in multiple places).

cnst
  • 13,848
  • 9
  • 54
  • 76
  • Thanks. Not sure I understand this sentence `If you're thinking about using VPS and NFS, I don't think your concerns about scalability are justified.` - could you clarify? At some point 1 machine is not going to cut it (VPS or not), so I'd rather have things set up for scale-out before that. – UpTheCreek Mar 25 '13 at 17:33
  • That's in the sense that NFS is not particularly known to be a performant solution; also, planning for scalability is a good idea, but if you know that your site can run out of a small VPS, then I think it might be just a little bit premature to think about scalability, and, thus, if you're really concerned, you might as well spend more money on a dedicated server, and think about scalability later. Again, with static content, scalability is really not a concern in the first place: it could be implemented with rsync and one more server in a snap. – cnst Mar 26 '13 at 16:07
  • @UpTheCreek, let me put it to you this way: do you plan on having more than 100Mbps throughput all the time, is that why you're concerned about scalability? If not, then your static content, served through nginx, will never outgrow a single dedicated server (perhaps even with a platter-based HDD, since 100Mbps is only 10MB/s after you convert bits to bytes and account for some overhead). If your files are really tiny, and can't be cached in RAM due to access being really random, and traffic is between 100Mbps and 10Mbps, then a platter-based HDDs might not work, but an SSD still would. – cnst Mar 26 '13 at 16:21
  • `That's in the sense that NFS is not particularly known to be a performant solution`, but that's the reason for my question, and as Dennis mentioned, NFS might not be a good fit. `but if you know that your site can run out of a small VPS` - I never said this, only that we are using VPS as the basis. There are many sites that have grown to a large scale on VPS (e.g. Pintest, Reddit - which has 200-odd EC2 instances AFAIK). The whole reason for my question is that I want to consider scalability of this part of the architecture now rather than later. – UpTheCreek Mar 26 '13 at 18:23
1

No production server design can have a single fault point.

Therefore you need at least two machines as load balancers, you can use a load balancer like HAProxy. It has all the features you may need, check this HAproxy arquitecture example. The actual request load you will face is "lots of small files requests" over a NFS storage system.

The number of caches is dependent on your resources, and client requests. HAProxy can be configured to add or remove cache servers.

The NFS file request is the most demanding operation, therefore you need a form of caching in your "cache" machines.

The cache server has 3 storage layers, you want the most common files to be available locally, and preferably in RAM.

  • NFS, by far the slowest. (REMOTE)
  • Local Storage, fast. (LOCAL)
  • Ram, ultra fast. (LOCAL)

This can be solved by nginx, squid or varnish..

nginx can cache locally files using SlowFs, this is a good slow fs tutorial

Nginx with this system stores files in the local filesystem disk and serves them from there. You can use PURGE to remove a modified file from cache. It is as simple as making a request with the word "purge" in the request string.

Nginx with Slow FS uses the ram the OS provides, increasing the nginx ram available by the OS will improve the request average speed. However if your storage exceeds the server ram size you still need to cache the files in the local filesystem.

Nginx is a multipurpose server it is not extremely fast. At least not as fast as static caching servers such as squid or varnish. However if your problem is the NFS, then Nginx solves 90% of the problem.

Squid and Varnish are very fast and have apis to remove files from cache.

Squid uses ram and the local filesystem for cache. Varnish uses ram for cache.

Squid is old and most benchmarks show that varnish is faster than squid dispaching static files.

However when varnish crashes the ram cache is lost and the whole server can take a lot of time recovering. Therefore a crash is a big problem for varnish.

Squid handles crashes better because it also uses the local storage disk, and can reboot some cache from there instead of using the NFS.

For optimal performance serving small static files you can use Nginx and Squid OR Varnish.

Other file sizes require a different approach.