Linux/PHP File access issues (Building custom cache)

Question

I have a minor problem, hoping someone can shed some light on the situation.

THE SITUATION:

I have a custom partial caching mechanism that is built into a PHP CMS. In short, when a template in the CMS is processed, it processes the 'cachable' PHP code and does not process the 'non-cachable' php code, then the resulting code is saved as a file to be processed for future visits to that page.

THE PROBLEM:

I am experience file access delays when the system is 'looking' for the cached file. 250,000 cached files using GLOB to find matching file takes 1/4s when no traffic is on site - sometimes 10-15s on traffic spikes. Almost seems like 2 separate client sessions cannot run GLOB simultaneously so they bottlneck.

WHAT I AM LOOKING FOR:

... is an alternative method, or optimization to provide block caching, without the bottleneck issues. Must consider my unique concerns (outlined below). I need a faster way to access these files or an alternative partial page cache direction to go :/

==============================================

ABRIDGED CODE:

// var to hold cached page path, if found
$pageCache = NULL;

// get the URL for current page
// there is actually some other code here that could alter the 'theURL4Cache' var for various reasons, but for simplicity in this example lets just keep it the REQUEST_URI
$GLOBALS['theURL4Cache'] = $_SERVER['REQUEST_URI'];

// check if existing cache file is in place
$filePattern = 'parsed/page_cache/*^' . $_SERVER['SERVER_PORT'] . '^' . $_SERVER['HTTP_HOST'] . '^' . (($_SESSION['isMobile']) ? 'M' : 'D') . '^L' . $language . '^T*^P' . $attributes['pageId'] . '^' . md5($GLOBALS['theURL4Cache']) . sha1($GLOBALS['theURL4Cache']) . '.php';
$fileArr = glob($filePattern);

// possible multiple files found that fit / expired files found that fit the pattern
// lets grab the newest file and try to use it
if(count($fileArr)){ 
        rsort($fileArr);
        $file = $fileArr[0]; // get file with latest expire date
        if($file > 'parsed/page_cache/' . date('Y-m-d-H-i-s')) $pageCache = $file; // set an attribute to hold the valid file path to the cached file
        // remove files that are no longer corrent
        for($i=(($pageCache === NULL) ? 0 : 1); $i<count($fileArr);$i++) unlink($fileArr[$i]);
    };
if($pageCache){
  // cached page is found, lets process and output this puppy
  include($pageCache);
} else {
  // cached page is not found, let's build a cacheable page from the CMS template
  $newCode = // ..... various code is processed here to isolate the cacheable code blocks and process while leaving the non-cacheable blocks intact ..... //
  // create the new file path where the cached code will be placed
  // first we need an expiration date
  $cacheDate = date_create();
  date_add($cacheDate, date_interval_create_from_date_string( $cache_increment . ' ' . $cache_interval)); // $cache_increment and $cache_interval are stored in the CMS DB for each page, giving the content manager control over the expiration of the page in cache
  $filePath = $GLOBALS['iProducts']['physicalRoot'] . "/parsed/page_cache/" . date_format($cacheDate,'Y-m-d-H-i-s') . "^" . $_SERVER['SERVER_PORT'] . '^' . $_SERVER['HTTP_HOST'] . '^' . (($_SESSION['isMobile']) ? 'M' : 'D') ."^L" . $language . "^T" . $templateId . "^P" . $attributes['pageId'] . "^" .md5($GLOBALS['theURL4Cache']) . sha1($GLOBALS['theURL4Cache']) . '.php'; // create the file path
  if(file_exists($filePath)) unlink($filePath); // delete the file if it already exists
  $fp = fopen($filePath,"w"); // create the new file
  flock($fp,LOCK_EX);
  fwrite($fp,$newCode); // write the cache file
  flock($fp,LOCK_UN);
  // now output this puppy
  eval($newCode);
};

WHY IS YOUR FILENAME SUCH A MESS, YOU ASK?

Well, I'm glad you asked! Another part of the CMS includes 'smart cache management' where if a page, or a template is modified by a content manager, all cached pages effected are purged from the system. In addition content in a page could vary due to the attribs in the URL query string, if it is a mobile device rendering or not, SSL vs non-SSL, the domain name or the current session language (the engine supports multiple language content all associated to the same page, conditionally outputted based on the session language.

So here is a cached page file name example: 2014-02-14-10-36-36^80^www.mydomain.com^M^L^T42^P41^a067036ef358f12a0049740f035a7ee688dbb0033c19a70163d6c453dbc5b84f1889ffe2.php

Here are the components of the file name: expire-date^port^domain^mobileOrDesktop^Language^Template^Page^md5+sha1OfURL.php

Here are the components explained:

expire-date: the calculated date/time this cache file should expire based on the content managers entry in the CMS. This can be used by GLOB to filter out all expired files and delete them for clean up via a CRON job. This is also to decide if the cached page is fresh enough to display in the beginning of the code.
port: 80 or 443 to signify if this was retrieved over SSL. Content may be conditionally different based on the SSL state.
domain: "www.mywebsite.com" Multiple domain names could be attached to a CMS installation, need to differentiate so two domains with same REQUEST_URI don't show each others' content
mobile or desktop: "M" or "D" - to allow same URL to 'sniff out' client and serve content accordingly.
language: "L" if not using multiple languages, "L-ENG / L-GER /..." if using multi language
template: "T#" so templateId 47 would be "T47" - allows for easy filter by file name to identify all cached pages using a given template to remove when the template is modified in the CMS.
page: "P#" so pageId 12 would be "P12" - allows for easy filter by file name to identify all cached versions of a given page to remove when the page is modified in the CMS.
md5+sha1OfURL.php: takes $_SERVER['REQUEST_URI'] and encodes it TWICE (once MD5, once SHA1) concatenating the results to give a (reasonably) unique ID representing the URL (since query strings can impact content).

Any ideas or advice is welcome. Thanks in advance!

250k files in a single directory? that's... crazy. You're forcing Linux to load up the directory file and parse through all 250k entries every time you run your glob operation. You should optmimize by auto-splitting into sub-directories, e.g. if your file hash is `abcdefg`, your should have the file into a `a/b/c/abcdefg` type sub-dir pattern. Do whatever level of layering makes most sense, and also keeps the number of files per-dir to some management number. — Marc B, Feb 18 '14 at 04:43
Thanks, Marc B. I have considered auto splitting into sub directories but then I run into other issues, such as searching based on the cache expiration date for the CRON clean up, and searching based on other factors like the pageId and templateId (for the 'smart' clean up of cache files as pages/templates are updated in the CMS). Basically it steals from Peter to give to Paul. — mwex501, Feb 18 '14 at 04:56
`find` can scan files based on inode data, such as last access time or last modification time. there's your scan-for-deletion in a nutshell. — Marc B, Feb 18 '14 at 04:57
You might consider symlinking those files in a separate directory that the cron will look into? — Ja͢ck, Feb 18 '14 at 04:57
Mark B: Filtering by access time/last modification time would work great is all pages had the same expiration, but unfortunately they do not, they can be set individually by page or by template. — mwex501, Feb 18 '14 at 05:00
Jack: I'm not sure where you are headed with your recommendation, could you provide more details? — mwex501, Feb 18 '14 at 05:01

score 0 · Answer 1 · answered Feb 18 '14 at 05:09

A few ideas:

You may want to consider an in-memory caching system such as memcached or redis rather than an on-disk one. The io load of reading and writing to disk can be substantial.
There are templating engines (like SMARTY http://www.smarty.net/) which will do a lot of this heavy lifting for you.

Rather than using globs to search directories, you should probably use a consistent hash rule for saving file names and then a secondary set of hashes for your tags. For example, save all files like this:

$fname = sha1($GLOBALS['theURL4Cache'].ALL_THE_REST_OF_THOSE_VARIABLES);
$cachedir1 = substr($fname ,0,1);
$cachedir2 = substr($fname,0,2);
$finalname = 'parsed/page_cache/' . $cachedir1 . '/' .$cachedir2 .'/'. $fname . '.php';

This would mean a file with a sha1 of abcdefgh would be saved as 'parsed/page_cache/a/ab/abcdefgh.php'. This will help reduce file system degradation.

Now, as long as you know the file name, you will be able to find the file - essentially you're using the directory path as a big hash table. For your tags, set up another set of hashes - these can be saved as files on disk or in a database or in memory. Whenever a page is added to one of your tags, add the sha1 digest or file name to the appropriate hash. For example in a file called 'parseHashes/is_mobile.php' you could have an array like

$arr = array(
    "abcd",
    "defg"
   );

Each key would be the sha1 digest that you need to look up and/or delete. When you need to manipulate the set of files in a tag, you can just iterate over the stored sha digests and you'd know where to find the file contents.

All that being said, you should really look into a memory-based or even databased solution for this. Doing this on the filesytem is a lot of work and doesn't scale well.

Great ideas! The system was originally a database (mysql) based cache system, but I ran into row / table locking issues - after testing the current file based method I found it to be a lot faster and more stable. Would a memcache solution be able to manage that volume of data (250,000 pages?). I'm going to toy around with some of the components of your recommendation and see if I can make it work with the rest of the requirements. — mwex501, Feb 18 '14 at 05:13
You may really want to check out Redis or Memcached. I think that Redis has tagging built in so it may be ideal for your use case. — dethtron5000, Feb 18 '14 at 05:16

Linux/PHP File access issues (Building custom cache)

1 Answers1