2

My current web project has the following characteristics:

  • A website which is basically a read-only archive of information. There are no interactive actions a visitor could do.
  • All pages (currently around 15k) of the website are pre-generated HTML files and graphics that are created on another machine.
  • The motivation behind this approach: since there is no dynamic processing and no database, the complexity of several web security aspects is much lower. Apart from that, the hope is to achieve good performance (or in other words, reduce runtime costs), since the whole website is a big single cache serving static files.

However, I underestimated the performance impact of keeping a large amount of files in a small amount of directories. Currently, the URLs of the website are mapped directly to the pre-generated directory structure on the file system. E.g. the address domain.com/categoryA/... maps to the directory webroot/pages/categoryA/... which contains a large amount of HTML pages, and the reading of files becomes slower and slower with every additional file that is added to that directory.

How could I solve this problem? Are there any webservers or server side technologies that especially address the problem of serving large amounts of static pages? A SEO-friendly URL structure should be preserved. Apart from that, I'm open for any suggestions.

JFo
  • 145
  • 6
  • 4
    How about running the lot out of AWS S3 - it's ideal for static sites, there's no processing to be done after all, I've no idea how the files-per-directory thing would work out but it might be worth a try. – Chopper3 Apr 30 '17 at 11:09
  • What filesystem and os are you running on the server? Virtualized or bare metal? – camelccc Apr 30 '17 at 11:21
  • 1
    I second @Chopper3's suggestion of S3. It doesn't care about how many files you have in a directory, because S3 doesn't actually *have* directories. – ceejayoz Apr 30 '17 at 11:47
  • @Chopper3 So you mean serving the files directly from S3 without any intermediary webserver? That sounds very promising, although I'm not sure how the aspect of SEO-friendly URLs would work out. I will look into it. – JFo Apr 30 '17 at 12:01
  • @ceejayoz (see comment above) – JFo Apr 30 '17 at 12:01
  • @camelccc the test environment is a local bare metal ext4 debian server. However, the goal is to run the production environment on Amazon or Google, so probably virtualized. – JFo Apr 30 '17 at 12:02
  • 2
    @JFo Yeah, direct from S3. It has a "host as website" setting, and you can put CloudFront in front of it for a real domain name and SSL. SEO-friendly URLs can be done with some configuration. – ceejayoz Apr 30 '17 at 12:03
  • I read through the docs and S3 seems to be a great idea. Someone of you could add an answer if you want. – JFo Apr 30 '17 at 12:49
  • Ok, do your webpages have links to all the files, or are you generating directory listings every time someone views a page? (Letting Apache search for files) Your configuration should not care about numbers of files per sey, but anything that involves ls can become painfully slow. – camelccc Apr 30 '17 at 13:32

1 Answers1

6

How about running the lot out of AWS S3 - it's ideal for static sites, there's no processing to be done after all, I've no idea how the files-per-directory thing would work out but it might be worth a try.

Chopper3
  • 101,299
  • 9
  • 108
  • 239
  • After some experiments I can confirm that using S3 is a really good approach (the only problem, albeit a non-technical one, is that DDOS attacks could be really painful from a financial perspective since there is no opt-in hard limit on bandwidth). – JFo Apr 30 '17 at 17:08
  • Do some digging on the whole AWS site, there's ways to deal with that issue, they're just not leaping to mind on a quiet Sunday afternoon :) – Chopper3 Apr 30 '17 at 17:46
  • 2
    Look at their Web Application Firewall – Matt McDonald May 01 '17 at 04:19