Harvesting Dynamic HTTP Content to produce Replicating HTTP Static Content

Question

[I asked this on stackoverflow.com, but they thought that this list would be better]

I have a slowly evolving dynamic website served from J2EE. The response time and load capacity of the server are inadequate for client needs. Moreover, ad hoc requests can unexpectedly affect other services running on the same application server/database. I know the reasons and can't address them in the short term. I understand HTTP caching hints (expiry, etags....) and for the purpose of this question, please assume that I have maxed out the opportunities to reduce load.

I am thinking of doing a brute force traversal of all URLs in the system to prime a cache and then copying the cache contents to geodispersed cache servers near the clients. I'm thinking of Squid or Apache HTTPD mod_disk_cache. I want to prime one copy and (manually) replicate the cache contents. I don't need a federation or intelligence amongst the slaves. When the data changes, invalidating the cache, I will refresh my master cache and update the slave versions, probably once a night.

Has anyone setup a http cache and then replicated it? Is it a good idea? Are there other technologies that I should investigate? I can program this, but I would prefer a configuration of open source technologies solution.

Thanks

ps context: The root problem is certainly:

Database query load on your DB server.
Business logic load on your web/application server.

The response time is often dozens of seconds (please, don't ask). As mentioned, I cannot address them in the short term (or rather, I am addressing them, but there are a great deal of them, and they are not JSP based and ....). I have clients with USA, European and Asian users, so I would very much like to replicate the cache once I have primed it. For internal corporate users, Akamai-like is not appropriate. I'd like to tar, zip the cache and FTP it back to the slaves. In other cases, the cache server, but not the app needs to be on a DMZ

score 1 · Answer 1 · answered Jul 10 '11 at 11:55

In theory, a big wget -r followed by a tarballing could do the trick. The problem is that, in practice, if you can get by with a wget run (no actual content changes) you can usually produce a largely-static site just as easily (replace dynamic pages with static ones).

If you weren't so keen on geolocation, a brutally over-caching varnish setup could do the job -- you can do wonderful things with varnish relating to automatically invalidating cache entries when the data backing the cached page changes. I'm not sure what your need is for the geolocation. If you're thinking that it'll move performance from "unacceptable" to "acceptable", that's very unlikely; while if you're needing geolocation for a reason unrelated to your current performance woes, it might be a matter of deciding which is the lesser of two evils -- fix your site performance issues with varnish, or fix whatever's supposed to be fixed with geolocation. With a site that takes seconds(!) to render a page, it'd have to be something pretty major to have a higher priority for a fix.

I saw a customer do this varnish trick with a horribly inefficient Tomcat site that was running an online store; they cached the bejesus out of everything (with ESI to do the customer login stuff) and had the admin interface prod varnish to say "clear out these URLs" whenever someone made a change to, say, a product price or description. It was ugly, but it worked well enough to keep them afloat until their app was fixed.

score 0 · Answer 2 · answered Feb 22 '11 at 20:09

0

Have you considered using http://memcached.org/ ?

answered Feb 22 '11 at 20:09

servermanfail

201
1
4
12

Harvesting Dynamic HTTP Content to produce Replicating HTTP Static Content

2 Answers2