force mediawiki squid cache to fill up with all pages

Question

For speeding up a MediaWiki site which has content that uses a lot of templates but otherwise pretty much has static content when the templates have done their jobs I'd like to setup a squid server see

https://www.mediawiki.org/wiki/Manual:PurgeList.php

and

https://www.mediawiki.org/wiki/Manual:Squid_caching

and then fill the squid server's cache "automatically" by using a script doing wget/curl calls that hit all pages of the Mediawiki. My expecation would be that after this procedure every single page is in the squid cache (if I make it big enough) and then each access would be done by squid.

How would i get this working? E.g.:

How do I check my configuration?
How would I find out how much memory is needed?
How could I check that the pages are in the squid3 cache?

What I tried so far

I started out by finding out how to install squid using:

https://wiki.ubuntuusers.de/squid

and

https://www.mediawiki.org/wiki/Manual:Squid_caching

I figured out my ip address xx.xxx.xxx.xxx (not disclosed here) via ifconfig eth0

in /etc/squid3/squid.conf I put

http port xx.xxx.xxx.xxx:80 transparent vhost defaultsite=XXXXXX
cache_peer 127.0.0.1 parent 80 3130 originserver 

acl manager proto cache_object
acl localhost src 127.0.0.1/32

# Allow access to the web ports
acl web_ports port 80
http_access allow web_ports

# Allow cachemgr access from localhost only for maintenance purposes
http_access allow manager localhost
http_access deny manager

# Allow cache purge requests from MediaWiki/localhost only
acl purge method PURGE
http_access allow purge localhost
http_access deny purge

# And finally deny all other access to this proxy
http_access deny all

Then I configured my apache2 server

# /etc/apache2/sites-enabled/000-default.conf   
Listen 127.0.0.1:80

I added

$wgUseSquid = true;
$wgSquidServers = array('xx.xxx.xxx.xxx');
$wgSquidServersNoPurge = array('127.0.0.1');

to my LocalSettings.php

Then I restarted apache2 and started squid3 with

service squid3 restart

and did a first access attempt with

wget --cache=off -r http://XXXXXX/mediawiki

the result is:

Resolving XXXXXXX (XXXXXXX)... xx.xxx.xxx.xxx
Connecting to XXXXXXX (XXXXXXX|xx.xxx.xx.xxx|:80... failed: Connection refused.

Drew Anderson · Answer 1 · 2015-11-14T21:48:24.623

Assuming Apache 2.x.

While not Squid related, you can achieve this using just Apache modules. Have a look at mod_cache here: https://httpd.apache.org/docs/2.2/mod/mod_cache.html

You can simply add this to your Apache configuration and ask Apache to do disk caching of rendered content.

You need to ensure your content has appropriate cache expiry information in the resulting PHP response, MediaWiki should take care of this for you.

Adding such a cache layer may not have the desired outcome as this layer does not know if a page has changed, cache management is difficult here and should only be used for actually static content.

Ubuntu:

a2enmod cache cache_disk

Apache configuration:

CacheRoot /var/cache/apache2/mod_disk_cache
CacheEnable disk /

I would not recommend pre-filling your cache by accessing every page. This will only cause dormant (not frequently used) pages to take up valuable space / memory. If you still wish to do this, you may look at wget:

Description from: http://www.linuxjournal.com/content/downloading-entire-web-site-wget
$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         www.website.org/tutorials/html/

This command downloads the Web site www.website.org/tutorials/html/.

The options are:

    --recursive: download the entire Web site.

    --domains website.org: don't follow links outside website.org.

    --no-parent: don't follow links outside the directory tutorials/html/.

    --page-requisites: get all the elements that compose the page (images, CSS and so on).

    --html-extension: save files with the .html extension.

    --convert-links: convert links so that they work locally, off-line.

    --restrict-file-names=windows: modify filenames so that they will work in Windows as well.

    --no-clobber: don't overwrite any existing files (used in case the download is interrupted and
    resumed).

A better option: Memcached

MediaWiki also supports the use of Memcached as a very fast in-memory caching service for data and templates only. This is not as brutal as a website-wide cache like Squid or Apache mod_cache. MediaWiki will manage Memcached so that any changes are immediately reflected in the cache store, meaning your content will always be valid.

Please see the installation instructions at MediaWiki here: https://www.mediawiki.org/wiki/Memcached

My recommendation is not to use Apache mod_cache or Squid for this task, and instead to install Memcached and configure MediaWiki to use it.

Thank you for looking into this and the detailed answer. The reason I was asking for squid has now been added to the question. Mediawiki can talk to squid but I don't know whether it would talk to the apache proposal you are making here. — Wolfgang Fahl, Nov 15 '15 at 06:39
Traditionally Mediawiki was designed to be deployed in a LAMP style environment (Linux, Apache, MySQL, PHP) and I have deployed it in this style a number of times. Assuming you are following that structure then the above will work fine as you will already be using Apache as your HTTP frontend. mod_cache within Apache would be behaving similar to your Squid cache you are trying to setup here. Still, I don't believe a content-cache like Squid / Apache will be a good choice here, and a cache accelerator like Memcached which is designed into Mediawiki would be better suited. — Drew Anderson, Nov 16 '15 at 22:11
This is an absolutely wrong and harmful answer: you're mixing different cache layers and not taking purges into account. Memached is an _object* cache. MW uses it to cache results of expensive queries/computations. It is necessary for a performant wiki (or, with a single appserver, APC shared storage), however it's not sufficient. Squid (or, better, Varnish) is a HTTP cache. It caches the resulting HTTP responses and as such it allows a tremendous reduction of Apache load. Wikipedia wouldn't be able to function at the present load levels without well-tuned HTTP caching. — MaxSem, Nov 20 '15 at 22:22
(continued) Speaking of mod_cache, I haven't heard of anyone using it with MW and as such I doubt it's fully suported. For example, does it support HTCP purges on page edits to avoid serving stale data? — MaxSem, Nov 20 '15 at 22:24

force mediawiki squid cache to fill up with all pages

1 Answers1