0

I've encountered a really strange bug I've been trying to resolve for the past few days unsuccessfully. I have a class for caching API calls, and a class used in a WordPress plugin to create custom API endpoints. In a nutshull the WordPress plugin hits an external API and caches the results with my caching class. Many individual items are accessed with the api, resulting in a few dozen local cache files.

The scenario

In my caching class, if a local cache has expired (it's older than the expiration time set on instantiation), the API gets hit again and results get cached as such:

file_put_contents($this->cache, $this->data, LOCK_EX);

In my WordPress plugin I want to loop through the cache files and remove any that haven't been accessed for N days. This method gets hit using cron. I'm checking the accessed time as such (this is still in development, printing for debug):

print($file . ': ' . date('F jS Y - g:i:s A', fileatime(UPLOADS_DIR . '/' . $file)));

Here's the full method so far:

public static function cleanup_old_caches($days = 30) {

    // Get the files
    $files = scandir(UPLOADS_DIR);

    // Save out .txt files
    $cache_files = array();

    // Loop through everything
    foreach ( $files as $file ) {
        if ( strstr($file, '.txt') ) {
            $cache_files[] = $file;
        }
    }

    // Loop through the cache files
    foreach ( $cache_files as $file ) {

        clearstatcache();
        print($file . ': ' . date('F jS Y - g:i:s A', fileatime(UPLOADS_DIR . '/' . $file)));
        echo "\n";
        clearstatcache();

    }

    return '';

}

You'll note I have a few clearstatcache() calls at the moment.

The problem

Any time a new cache file gets created, the accessed time as reported by fileatime() for many other files in the same directory gets updated to the current time. These sometimes say a second after the new cache file.

Here's my full method:

private function hit_api() {

    // Set the return from the API as the data to use
    $this->data = $this->curl_get_contents($this->api_url);

    // Store the API's data for next time
    file_put_contents($this->cache, $this->data, LOCK_EX);

}

I can find another way to write my cleanup logic, but I'm concerned that PHP is actually touching each of these files (I've seen 12 out of 18 for one new file).

Things I've tried

  • clearstatcache() calls absolutely _everywhere)
  • Manually doing all the fopen(), fwrite(), fflush(), fclose() steps manually
  • Writing the file names being written at the point of the file_put_contents() call

If anybody has an idea what's going on here I'll be muuuch appreciative.

nathansh
  • 81
  • 1
  • 10
  • 1
    I would recommend trying to reduce it down further, to the absolute minimum code to reproduce this. However, why are you using `fileatime` (access) rather than `filemtime` (modify)? – Alexander O'Mara Jan 24 '16 at 01:55
  • I'd love to remove files based on last _access_ instead of last modification. – nathansh Jan 24 '16 at 01:57
  • doesn't make sense... a file that is 3 months old could have an access time an hour old – charlietfl Jan 24 '16 at 02:05
  • @charlietfl That's the logic I want, remove files that haven't been _accessed_ in a while, not ones that were _created_ a while ago. – nathansh Jan 24 '16 at 02:13
  • Well, you should consider deleting on last modification date, as your cache should be up to date and that's not going to happen if the cached file gets often accessed while not hitting the necessary last access time to get deleted. – Charlotte Dunois Jan 24 '16 at 02:33
  • @CharlotteDunois I'm also concerned that PHP is actually doing more work than I want it to – nathansh Jan 24 '16 at 02:35
  • So yeah, I'll definitely consider using modified time. That could totally work. I'm just really confused as to why access time is being effected at all. – nathansh Jan 24 '16 at 02:38
  • `APIPenguin` will check cache before `hit_api()`? https://github.com/nathansh/APIPenguin/blob/master/APIPenguin.class.php#L90 – jsxqf Jan 24 '16 at 06:25
  • @jsxqf it does, it checks filemtime() and such, but only on that particular cache file, not other files in the directory – nathansh Jan 24 '16 at 06:32
  • are you sure it only use one cache file ? Maybe other file is also used. Try switch a different cache directory. – jsxqf Jan 24 '16 at 06:40
  • where does $this->curl_get_contents($this->api_url); go? Is it the same site? – Gavriel Jan 24 '16 at 07:00
  • @Gavriel that his an external API which I'm caching locally – nathansh Jan 25 '16 at 23:24
  • Are you 100% sure that there aren't parallel requests? I mean if you look at the access.log, error.log, you only see your request? – Gavriel Jan 25 '16 at 23:27
  • @Gavriel yup, this is on my local – nathansh Jan 25 '16 at 23:42
  • This APIPenguin.class.php you mention above seems to have lots of file access functions. Can you try to remove it from the equation? – Gavriel Jan 25 '16 at 23:50

1 Answers1

1

After a week of writing tests and recreating this issue with as little as a plain call with file_put_contents(), I've found the source of this issue. Get ready for it... Spotlight was indexing these files. Excluded from Spotlight, removed the cache files, started again, no issue.

enter image description here

nathansh
  • 81
  • 1
  • 10